CN115114283A

Movatterモバイル変換

Info

Publication number: CN115114283A
Application number: CN202210576032.0A
Authority: CN
Inventors: 张雨春
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-27
Anticipated expiration: 2042-05-25
Also published as: CN115114283B

Abstract

The application discloses a data processing method, a data processing device, a computer readable medium and an electronic device, wherein the method comprises the following steps: performing hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed; overlapping the data to be processed to the data at the positions corresponding to the hash algorithms and the hash values in a preset data list to obtain overlapped data corresponding to each hash algorithm; sequencing the superposed data corresponding to each Hash algorithm in a preset data list, and acquiring target superposed data and non-target data according to a sequencing result; and calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data. According to the technical scheme, the data to be processed is converted from the original value space data into the data occupying a smaller storage space, so that the requirement on the storage space during data processing is greatly reduced, and resources required by index calculation are saved.

Description

Translated fromChinese

数据处理方法、装置、计算机可读介质及电子设备Data processing method, apparatus, computer readable medium and electronic device

技术领域technical field

本申请属于数据处理技术领域，具体涉及一种数据处理方法、装置、计算机可读介质及电子设备。The present application belongs to the technical field of data processing, and specifically relates to a data processing method, apparatus, computer-readable medium, and electronic equipment.

背景技术Background technique

随着大数据时代的到来，进行数据分析时所需处理的原始数据越来越多。一般情况下，原始数据根据数据产生时间采用逐条存储方式进行记录，在进行数据处理时，从相应的数据存储位置获取原始数据进行计算，如根据某些条件对原始数据进行累加计算，然后将计算结果一并存储，方便后续使用。然而，原始数据数量越多，这种逐条存储的方式所占用的存储空间也越大，同时，数据处理时所需访问的数据存储位置也将增加，可能降低数据处理效率。With the advent of the era of big data, more and more raw data needs to be processed for data analysis. In general, the original data is recorded in a one-by-one storage method according to the data generation time. During data processing, the original data is obtained from the corresponding data storage location for calculation. For example, the original data is accumulated according to certain conditions, and then the calculation is performed. The results are stored together for subsequent use. However, the greater the amount of original data, the greater the storage space occupied by this method of storing one by one, and at the same time, the data storage locations that need to be accessed during data processing will also increase, which may reduce the efficiency of data processing.

需要说明的是，在上述背景技术部分公开的信息仅用于加强对本申请的背景的理解，因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above Background section is only for enhancing understanding of the background of the application, and therefore may include information that does not form the prior art known to a person of ordinary skill in the art.

发明内容SUMMARY OF THE INVENTION

本申请的目的在于提供一种数据处理方法、装置、计算机可读介质及电子设备，以优化相关技术中数据处理所占用存储空间较大的问题。The purpose of the present application is to provide a data processing method, apparatus, computer readable medium and electronic device to optimize the problem of large storage space occupied by data processing in the related art.

本申请的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本申请的实践而习得。Other features and advantages of the present application will become apparent from the following detailed description, or be learned in part by practice of the present application.

根据本申请实施例的一个方面，提供一种数据处理方法，包括：According to an aspect of the embodiments of the present application, a data processing method is provided, comprising:

通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值；Perform hash operation processing on each data to be processed in the target scene by using at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed;

根据所述待处理数据对应的至少一个哈希值，将所述待处理数据叠加至预设数据列表中与所述哈希算法和所述哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据；According to at least one hash value corresponding to the data to be processed, the data to be processed is superimposed on the data at the position corresponding to the hash algorithm and the hash value in the preset data list to obtain each hash algorithm Corresponding overlay data;

对所述预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于所述各个哈希算法、且与设定指标匹配的目标叠加数据，并根据所述各个哈希算法对应的叠加数据中除所述目标叠加数据之外的其它叠加数据，生成所述各个哈希算法对应的非目标数据；Sort the superimposed data corresponding to each hash algorithm in the preset data list, obtain the target superimposed data corresponding to the each hash algorithm and match the set index according to the sorting result, and according to the each hash algorithm other superimposed data other than the target superimposed data in the superimposed data corresponding to the algorithm, generate non-target data corresponding to each of the hash algorithms;

根据所述目标叠加数据和所述非目标数据，计算所述多个待处理数据中与所述设定指标匹配的数据，以得到所述目标场景对应的数据处理结果。According to the target overlay data and the non-target data, the data matching the set index among the plurality of data to be processed is calculated, so as to obtain a data processing result corresponding to the target scene.

根据本申请实施例的一个方面，提供一种数据处理装置，包括：According to an aspect of the embodiments of the present application, a data processing apparatus is provided, including:

哈希运算模块，用于通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值；A hash operation module, configured to perform hash operation processing on each data to be processed in the target scene through at least one hash algorithm, and obtain at least one hash value corresponding to each data to be processed;

数据叠加模块，用于根据所述待处理数据对应的至少一个哈希值，将所述待处理数据叠加至预设数据列表中与所述哈希算法和所述哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据；A data superimposing module, configured to superimpose the data to be processed to the data at the position corresponding to the hash algorithm and the hash value in the preset data list according to at least one hash value corresponding to the data to be processed , to obtain the superimposed data corresponding to each hash algorithm;

数据计算模块，用于对所述预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于所述各个哈希算法、且与设定指标匹配的目标叠加数据，并根据所述各个哈希算法对应的叠加数据中除所述目标叠加数据之外的其它叠加数据，生成所述各个哈希算法对应的非目标数据；a data calculation module, configured to sort the superimposed data corresponding to each hash algorithm in the preset data list, and obtain the target superimposed data corresponding to the each hash algorithm and matching the set index according to the sorting result, and generate non-target data corresponding to each hash algorithm according to other superimposed data other than the target superimposed data in the superimposed data corresponding to each hash algorithm;

结果生成模块，用于根据所述目标叠加数据和所述非目标数据，计算所述多个待处理数据中与所述设定指标匹配的数据，以得到所述目标场景对应的数据处理结果。A result generating module, configured to calculate data matching the set index among the plurality of data to be processed according to the target overlay data and the non-target data, so as to obtain a data processing result corresponding to the target scene.

在本申请的一个实施例中，所述待处理数据包括键值对形式的数据；哈希运算模块具体用于：通过至少一种哈希算法对目标场景中的各个待处理数据的键进行哈希运算处理；In an embodiment of the present application, the data to be processed includes data in the form of key-value pairs; the hash operation module is specifically configured to: perform hashing on the keys of each data to be processed in the target scene through at least one hash algorithm. Greek operation processing;

数据叠加模块具体用于：根据所述待处理数据对应的至少一个哈希值，将所述待处理数据的值叠加至预设数据列表中与所述哈希算法和所述哈希值对应位置处的数据。The data overlay module is specifically configured to: according to at least one hash value corresponding to the data to be processed, superimpose the value of the data to be processed to the position corresponding to the hash algorithm and the hash value in the preset data list data at.

在本申请的一个实施例中，所述哈希算法包括哈希函数运算和取模运算；哈希运算模块包括：In an embodiment of the present application, the hash algorithm includes a hash function operation and a modulo operation; the hash operation module includes:

哈希计算单元，用于通过至少一种哈希函数运算对所述目标场景中的各个待处理数据的键进行哈希计算，得到各个待处理数据的至少一个哈希结果；a hash calculation unit, configured to perform hash calculation on the keys of each data to be processed in the target scene through at least one hash function operation, and obtain at least one hash result of each data to be processed;

取模运算单元，单元将所述各个待处理数据的至少一个哈希结果针对预设哈希分桶数进行取模运算，将取模运算的结果作为所述各个待处理数据对应的至少一个哈希值；所述预设哈希分桶数用于指示所述预设数量列表所占用存储空间的大小。A modulo operation unit, the unit performs a modulo operation on at least one hash result of each of the data to be processed against a preset hash bucket number, and uses the result of the modulo operation as at least one hash result corresponding to each of the data to be processed. A value; the preset hash bucket number is used to indicate the size of the storage space occupied by the preset number list.

在本申请的一个实施例中，所述哈希算法对应的叠加数据包括多个哈希桶存储的叠加数据，一个哈希桶表示所述预设数据列表中的一个存储位置；数据计算模块包括：In an embodiment of the present application, the overlay data corresponding to the hash algorithm includes overlay data stored in multiple hash buckets, and one hash bucket represents a storage location in the preset data list; the data calculation module includes :

目标叠加数据生成单元，用于根据排序结果获取对应于所述各个哈希算法、且与设定指标匹配的设定数量个哈希桶所存储的叠加数据；对所述设定数量个哈希桶所存储的叠加数据求和，作为所述各个哈希算法对应的目标叠加数据。a target overlay data generation unit, configured to obtain, according to the sorting result, the overlay data stored in a set number of hash buckets corresponding to the respective hash algorithms and matching the set index; The sum of the superimposed data stored in the bucket is used as the target superimposed data corresponding to each of the hash algorithms.

在本申请的一个实施例中，数据计算模块包括：In an embodiment of the present application, the data calculation module includes:

数值期望计算单元，用于根据所述哈希算法对应的叠加数据中除所述目标叠加数据之外的其它叠加数据，以及所述其它叠加数据对应的待处理数据量，生成非目标数据的数值期望；A numerical expectation calculation unit, configured to generate a numerical value of non-target data according to other superimposed data other than the target superimposed data in the superimposed data corresponding to the hash algorithm, and the amount of data to be processed corresponding to the other superimposed data expect;

数量期望计算单元，用于根据所述目标叠加数据对应的待处理数据量，计算非目标数据的数量期望；a quantity expectation calculation unit, configured to calculate the quantity expectation of non-target data according to the amount of data to be processed corresponding to the target overlay data;

非目标数据计算单元，用于根据所述非目标数据的数值期望和所述非目标数据的数量期望的乘积，生成所述哈希算法对应的非目标数据。A non-target data calculation unit, configured to generate non-target data corresponding to the hash algorithm according to the product of the numerical expectation of the non-target data and the expected quantity of the non-target data.

在本申请的一个实施例中，所述数值期望计算单元包括：In an embodiment of the present application, the numerical expectation calculation unit includes:

非目标叠加数据生成子单元，用于对所述哈希算法对应的叠加数据中除所述目标叠加数据之外的其它叠加数据求和，得到所述哈希算法对应的非目标叠加数据；a non-target superimposed data generation subunit, used for summing other superimposed data other than the target superimposed data in the superimposed data corresponding to the hash algorithm to obtain the non-target superimposed data corresponding to the hash algorithm;

数据去重子单元，用于根据所述其它叠加数据对应的待处理数据量，确定所述其它叠加数据对应的待处理数据去重数；a data deduplication subunit, configured to determine the number of deduplication data to be processed corresponding to the other superimposed data according to the amount of data to be processed corresponding to the other superimposed data;

数值期望计算子单元，用于根据所述非目标叠加数据和所述其它叠加数据对应的待处理数据去重数的比值，得到非目标数据的数值期望。The numerical expectation calculation subunit is configured to obtain the numerical expectation of the non-target data according to the ratio of the deduplication numbers of the non-target superimposed data and the other superimposed data corresponding to the to-be-processed data.

在本申请的一个实施例中，所述数据去重子单元具体用于：In an embodiment of the present application, the data deduplication subunit is specifically used for:

根据待处理数据的键，对所述目标场景中的多个待处理数据进行去重处理，得到待处理数据去重总数；Perform deduplication processing on a plurality of data to be processed in the target scene according to the key of the data to be processed, to obtain the total number of deduplicated data to be processed;

根据待处理数据的键，对所述目标叠加数据对应的多个待处理数据进行去重处理，得到所述目标叠加数据对应的待处理数据去重数；According to the key of the data to be processed, deduplication processing is performed on a plurality of data to be processed corresponding to the target superimposed data, and the deduplication number of the data to be processed corresponding to the target superimposed data is obtained;

根据所述待处理数据去重总数和所述目标叠加数据对应的待处理数据去重数的差值，得到所述其它叠加数据对应的待处理数据去重数。According to the difference between the total number of data deduplication to be processed and the deduplication number of data to be processed corresponding to the target superimposed data, the deduplication number of data to be processed corresponding to the other superimposed data is obtained.

在本申请的一个实施例中，所述数量期望计算单元包括：In an embodiment of the present application, the expected quantity calculation unit includes:

第一计算子单元，用于根据预设哈希分桶数、所述目标场景对应的待处理数据去重总数和预设拟合函数，生成所述目标叠加数据的数量期望；a first calculation subunit, configured to generate the expected quantity of the target superimposed data according to the preset hash bucket number, the total number of deduplication data to be processed corresponding to the target scene, and a preset fitting function;

第二计算子单元，用于根据所述目标叠加数据的数量期望和所述设定指标的数值之差，得到所述非目标数据的数量期望。The second calculation subunit is configured to obtain the expected quantity of the non-target data according to the difference between the expected quantity of the target superimposed data and the value of the set index.

在本申请的一个实施例中，所述数量期望计算单元还包括：In an embodiment of the present application, the expected quantity calculation unit further includes:

拟合函数构建单元，用于根据待定拟合系数，构建与哈希分桶数和待处理数据去重总数相关的拟合函数；The fitting function construction unit is used to construct a fitting function related to the number of hash buckets and the total number of data deduplication to be processed according to the undetermined fitting coefficient;

训练单元，用于通过样本数据对所述拟合函数进行训练，以得到所述待定拟合系数的目标数值；所述样本哈希分桶数和所述目标场景对应的样本数据去重总数；a training unit, configured to train the fitting function through sample data to obtain the target value of the undetermined fitting coefficient; the number of buckets of the sample hash and the total number of sample data deduplication corresponding to the target scene;

预设拟合函数生成单元，用于根据所述待定拟合系数的目标数值生成所述预设拟合函数。A preset fitting function generating unit, configured to generate the preset fitting function according to the target value of the undetermined fitting coefficient.

在本申请的一个实施例中，所述训练单元具体用于：In an embodiment of the present application, the training unit is specifically used for:

随机生成所述待定拟合系数的初始值；randomly generating initial values of the undetermined fitting coefficients;

通过所述待定拟合系数为初始值的拟合函数对样本数据进行计算，得到所述样本数据的预测数量期望；Calculate the sample data by using the fitting function with the undetermined fitting coefficient as the initial value to obtain the predicted quantity expectation of the sample data;

根据所述样本数据的预测数量期望与所述样本数据的实际数量期望之间的差值，调整所述待定拟合系数的初始值，直至所述差值小于预设阈值，得到所述待定拟合系数的目标数值。According to the difference between the predicted quantity expectation of the sample data and the actual quantity expectation of the sample data, the initial value of the undetermined fitting coefficient is adjusted until the difference is smaller than the preset threshold, and the undetermined fitting coefficient is obtained. The target value of the composite coefficient.

在本申请的一个实施例中，结果生成模块包括：In an embodiment of the present application, the result generation module includes:

数据处理单元，用于根据所述各个哈希算法对应的目标叠加数据与非目标数据之差，得到所述各个哈希算法对应的多个待处理数据中与所述设定指标匹配的数据；a data processing unit, configured to obtain, according to the difference between the target superimposed data and the non-target data corresponding to the respective hash algorithms, data that matches the set index among the plurality of data to be processed corresponding to the respective hash algorithms;

统计单元，用于对所述各个哈希算法对应的多个待处理数据中与所述设定指标匹配的数据进行统计处理，得到所述目标场景对应的数据处理结果。A statistical unit, configured to perform statistical processing on the data matching the set index among the plurality of to-be-processed data corresponding to each hash algorithm, to obtain a data processing result corresponding to the target scene.

在本申请的一个实施例中，所述统计单元具体用于：In an embodiment of the present application, the statistical unit is specifically used for:

计算所述各个哈希算法对应的多个待处理数据中与所述设定指标匹配的数据的期望值，以作为所述目标场景对应的数据处理结果。Calculate the expected value of the data matching the set index among the plurality of data to be processed corresponding to the respective hash algorithms, as the data processing result corresponding to the target scene.

根据本申请实施例的一个方面，提供一种计算机可读介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如以上技术方案中的数据处理方法。According to an aspect of the embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the data processing method in the above technical solution.

根据本申请实施例的一个方面，提供一种电子设备，该电子设备包括：处理器；以及存储器，用于存储所述处理器的可执行指令；其中，所述处理器执行所述可执行指令使得所述电子设备执行如以上技术方案中的数据处理方法。According to an aspect of the embodiments of the present application, an electronic device is provided, the electronic device includes: a processor; and a memory for storing executable instructions of the processor; wherein the processor executes the executable instructions The electronic device is caused to execute the data processing method in the above technical solution.

根据本申请实施例的一个方面，提供一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行如以上技术方案中的数据处理方法。According to one aspect of the embodiments of the present application, there is provided a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method as in the above technical solutions.

在本申请实施例提供的技术方案中，通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值，并根据哈希算法和哈希值对待处理数据进行叠加存储，哈希算法使得待处理数据从原始值空间数据转化为占用较小存储空间的数据，从而极大地降低了数据处理时对存储空间的要求；然后对各个哈希算法对应的叠加数据排序，并基于排序结果获取与设定指标匹配的目标叠加数据，以及基于除目标叠加数据外的其它叠加数据生成非目标数据；最后根据目标叠加数据和非目标数据得到与设定指标匹配的待处理数据，生成数据处理结果，相当于使用少量的计算资源实现了大规模数据的处理和计算，得到对应的指标数据，节省了指标计算所需的资源。In the technical solution provided by the embodiment of the present application, each data to be processed in the target scene is subjected to hash operation processing by at least one hash algorithm, to obtain at least one hash value corresponding to each data to be processed, and according to the hash value The algorithm and the hash value are superimposed and stored for the data to be processed. The hash algorithm converts the data to be processed from the original value space data to the data occupying a small storage space, thus greatly reducing the storage space requirements during data processing; The overlay data corresponding to each hash algorithm is sorted, and the target overlay data matching the set index is obtained based on the sorting result, and the non-target data is generated based on other overlay data except the target overlay data; finally, according to the target overlay data and non-target data Obtaining the data to be processed that matches the set index and generating the data processing result is equivalent to using a small amount of computing resources to realize the processing and calculation of large-scale data, obtaining the corresponding index data, and saving the resources required for index calculation.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1A示意性地示出了应用本申请技术方案的示例性系统架构框图。FIG. 1A schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.

图1B示意性地示出了本申请技术方案的一种应用场景的示意图。FIG. 1B schematically shows a schematic diagram of an application scenario of the technical solution of the present application.

图1C示意性地示出了本申请技术方案的另一种应用场景的示意图。FIG. 1C schematically shows a schematic diagram of another application scenario of the technical solution of the present application.

图2示意性地示出了本申请一个实施例提供的数据处理方法的流程图。FIG. 2 schematically shows a flowchart of a data processing method provided by an embodiment of the present application.

图3示意性地示出了本申请一个实施例提供的数据处理方法的流程图。FIG. 3 schematically shows a flowchart of a data processing method provided by an embodiment of the present application.

图4示意性地示出了本申请一个实施例提供的预设数据列表的示意图。FIG. 4 schematically shows a schematic diagram of a preset data list provided by an embodiment of the present application.

图5示意性地示出了本申请一个实施例提供的叠加数据的排序结果的示意图。FIG. 5 schematically shows a schematic diagram of a sorting result of superimposed data provided by an embodiment of the present application.

图6示意性地示出了本申请一个实施例提供的通过线性拟合方式构建预设拟合函数的图形。FIG. 6 schematically shows a graph of constructing a preset fitting function by means of linear fitting provided by an embodiment of the present application.

图7示意性地示出了本申请实施例提供的数据处理装置的结构框图。FIG. 7 schematically shows a structural block diagram of a data processing apparatus provided by an embodiment of the present application.

图8示意性示出了适于用来实现本申请实施例的电子设备的计算机系统结构框图。FIG. 8 schematically shows a structural block diagram of a computer system suitable for implementing the electronic device of the embodiment of the present application.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本申请将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本申请的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本申请的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present application.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.

如图1A所示，系统架构100可以包括终端设备110、网络120和服务器130。终端设备110可以包括智能手机、平板电脑、笔记本电脑、智能语音交互设备、智能家电、车载终端等等。服务器130可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云计算服务的云服务器。网络120可以是能够在终端设备110和服务器130之间提供通信链路的各种连接类型的通信介质，例如可以是有线通信链路或者无线通信链路。As shown in FIG. 1A , thesystem architecture 100 may include aterminal device 110 , anetwork 120 and aserver 130 . Theterminal device 110 may include a smart phone, a tablet computer, a notebook computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. Theserver 130 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server providing cloud computing services. Thenetwork 120 may be a communication medium of various connection types capable of providing a communication link between theterminal device 110 and theserver 130, such as a wired communication link or a wireless communication link.

根据实现需要，本申请实施例中的系统架构可以具有任意数目的终端设备、网络和服务器。例如，服务器130可以是由多个服务器设备组成的服务器群组。另外，本申请实施例提供的技术方案可以应用于终端设备110，也可以应用于服务器130，或者可以由终端设备110和服务器130共同实施，本申请对此不做特殊限定。According to implementation requirements, the system architecture in this embodiment of the present application may have any number of terminal devices, networks, and servers. For example, theserver 130 may be a server group composed of a plurality of server devices. In addition, the technical solutions provided in the embodiments of the present application may be applied to theterminal device 110 or theserver 130 , or may be jointly implemented by theterminal device 110 and theserver 130 , which is not specifically limited in the present application.

在本申请的一个实施例中，本申请实施例提供的数据处理方法由服务器130实施，具体而言：服务器130通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值；其中，一个待处理数据经过一种哈希算法得到一个哈希值，待处理数据可以由终端设备110发送至服务器130。然后，服务器130根据待处理数据对应的至少一个哈希值，将待处理数据叠加至预设数据列表中与哈希算法和哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据；也即，待处理数据通过叠加存储的方式存储至预设数据列表。接下来，服务器130对预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于各个哈希算法，且与设定指标匹配的目标叠加数据，并根据各个哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据，生成各个哈希算法对应的非目标数据；其中，设定指标通常用来限定数据的选取范围，例如，设定指标指示选取排序前五的数据。最后，服务器130根据目标叠加数据和非目标数据，计算多个待处理数据中与设定指标匹配的数据，以得到目标场景对应的数据处理结果。本申请技术方案相当于根据待处理数据的叠加存储数据计算，得到原始的待处理数据中与设定指标匹配的数据。服务器130可以将数据处理结果发送至终端设备110，以便终端设备110对数据处理结果进行可视化展示。In an embodiment of the present application, the data processing method provided by the embodiment of the present application is implemented by theserver 130. Specifically, theserver 130 performs hash operation processing on each data to be processed in the target scene through at least one hash algorithm. , to obtain at least one hash value corresponding to each data to be processed; wherein, one data to be processed obtains a hash value through a hash algorithm, and the data to be processed can be sent by theterminal device 110 to theserver 130 . Then, according to at least one hash value corresponding to the data to be processed, theserver 130 superimposes the data to be processed on the data corresponding to the hash algorithm and the hash value in the preset data list, and obtains the superimposed data corresponding to each hash algorithm ; That is, the data to be processed is stored in the preset data list by means of superimposed storage. Next, theserver 130 sorts the superimposed data corresponding to each hash algorithm in the preset data list, obtains the target superimposed data corresponding to each hash algorithm and matches the set index according to the sorting result, and according to each hash algorithm Other overlay data except the target overlay data in the overlay data corresponding to the algorithm, generate non-target data corresponding to each hash algorithm; among them, the setting index is usually used to limit the selection range of the data, for example, the setting index indicates the selection order The top five data. Finally, theserver 130 calculates, according to the target overlay data and the non-target data, data matching the set index among the plurality of data to be processed, so as to obtain a data processing result corresponding to the target scene. The technical solution of the present application is equivalent to calculating according to the superimposed storage data of the data to be processed, and obtaining the data matching the set index in the original data to be processed. Theserver 130 may send the data processing result to theterminal device 110, so that theterminal device 110 can visualize the data processing result.

在本申请的一个实施例中，以目标场景为虚拟资源的转移场景为例说明本申请技术方案的实现过程，设定指标为虚拟资源转移总额排名前三的账户的资源转移量。在虚拟资源的转移场景中，一个待处理数据表示一条转移记录，表示一个账户进行一次结算过程所转移的虚拟资源量，可以表示为(账户标识，资源转移量)，例如，某账户标识为123，资源转移量为100，则待处理数据表示为(123，100)。In one embodiment of the present application, the implementation process of the technical solution of the present application is described by taking a transfer scenario where the target scenario is a virtual resource as an example, and the set indicator is the resource transfer amount of the top three accounts in the total virtual resource transfer amount. In the transfer scenario of virtual resources, a piece of data to be processed represents a transfer record, which represents the amount of virtual resources transferred by an account in a settlement process, which can be expressed as (account ID, resource transfer amount), for example, an account ID is 123 , the resource transfer amount is 100, and the data to be processed is represented as (123, 100).

在虚拟资源的转移场景中，一个账户可以进行多次结算，即具有多条转移记录，那么一个账户可以对应多个待处理数据，例如，某账户对应的待处理数据包括(账户标识，资源转移量1)、(账户标识，资源转移量2)、(账户标识，资源转移量3)等。将虚拟资源的转移场景中的待处理数据记为：(账户标识1，资源转移量1)、(账户标识2，资源转移量2)…(账户标识n，资源转移量n)，其中，账户标识的编号主要用于区分转移记录，这些不同编号的账户标识中可能存在相同的账户标识。示例性的，图1B示意性地示出了本申请技术方案的一种应用场景的示意图。如图1B所示，一方面，虚拟资源的账户1向虚拟资源的账户2转移若干的虚拟资源，此次资源转移形成一条转移记录(账户标识1，资源转移量1)。另一方面，虚拟资源的账户1向虚拟资源的账户3转移若干的虚拟资源，此次资源转移形成一条转移记录(账户标识2，资源转移量2)。这两条转移记录均为待处理数据，可见，两条转移记录用账户标识1和账户标识2进行了区分，但是账户标识1和账户标识2是相同的账户标识，都是虚拟资源的账户1的账户标识。In the transfer scenario of virtual resources, an account can perform multiple settlements, that is, it has multiple transfer records, then one account can correspond to multiple data to be processed. For example, the data to be processed corresponding to an account includes (account ID, resource transfer Amount 1), (Account ID, Resource Transfer Amount 2), (Account ID, Resource Transfer Amount 3), etc. Record the data to be processed in the virtual resource transfer scenario as: (accountID 1, resource transfer amount 1), (account ID 2, resource transfer amount 2)...(account ID n, resource transfer amount n), where the account ID The ID number is mainly used to distinguish transfer records, and the same account ID may exist in these account IDs with different numbers. Exemplarily, FIG. 1B schematically shows a schematic diagram of an application scenario of the technical solution of the present application. As shown in FIG. 1B , on the one hand,account 1 of virtual resources transfers several virtual resources to account 2 of virtual resources, and this resource transfer forms a transfer record (accountidentifier 1, resource transfer amount 1). On the other hand,account 1 of virtual resources transfers some virtual resources to account 3 of virtual resources, and this resource transfer forms a transfer record (accountidentifier 2, resource transfer amount 2). These two transfer records are data to be processed. It can be seen that the two transfer records are distinguished byaccount ID 1 andaccount ID 2, butaccount ID 1 andaccount ID 2 are the same account ID, both of which areaccount 1 of virtual resources. account ID.

首先，服务器130通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值。假设以f(x)表示哈希算法，则对待处理数据进行哈希运算处理得到哈希值可以表示为：f(账户标识)＝哈希值。若哈希算法共有f₁(x)、f₂(x)…f_n(x)，那么对于待处理数据(账户标识1，资源转移量1)，f₁(账户标识1)＝哈希值1，f₂(账户标识1)＝哈希值2，…，f_n(账户标识1)＝哈希值n，也就是一种哈希算法对待处理数据进行一次哈希运算处理可以得到一个哈希值，一个待处理数据通过至少一种哈希算法的哈希运算处理，可以得到至少一个哈希值。示例性的，如图1B所示，待处理数据(账户标识1，资源转移量1)经过哈希算法f₁(x)、f₂(x)…f_n(x)的处理，得到n个哈希值；待处理数据(账户标识2，资源转移量2)经过哈希算法f₁(x)、f₂(x)…f_n(x)的处理，得到n个哈希值。First, theserver 130 performs hash operation processing on each data to be processed in the target scene by using at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed. Assuming that the hash algorithm is represented by f(x), the hash value obtained by performing the hash operation on the data to be processed can be expressed as: f (account identifier)=hash value. If the hash algorithm has f₁ (x), f₂ (x)...f_n (x), then for the data to be processed (account ID 1, resource transfer amount 1), f₁ (account ID 1)=hash value 1, f₂ (account ID 1) =hash value 2, ..., f_n (account ID 1) = hash value n, that is, a hash algorithm performs one hash operation on the data to be processed, and a hash value can be obtained. Hash value, at least one hash value can be obtained by processing a piece of data to be processed through the hash operation of at least one hash algorithm. Exemplarily, as shown in FIG. 1B , the data to be processed (account identifier 1, resource transfer amount 1) are processed by hash algorithms f₁ (x), f₂ (x)...f_n (x), and n data are obtained. Hash value; the data to be processed (account ID 2, resource transfer amount 2) is processed by the hash algorithms f₁ (x), f₂ (x)...f_n (x), and n hash values are obtained.

然后，服务器130根据待处理数据对应的至少一个哈希值，将待处理数据叠加至预设数据列表中与哈希算法和哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据。示例性的，如图1B所示，待处理数据(账户标识1，资源转移量1)经过哈希算法f₁(x)得到的哈希值为1，则将资源转移量1叠加至预设数据列表中(1，1)所指示的位置处。假设(1，1)处存储的数据为200，资源转移量1为100，则叠加是指，将(1，1)处存储的数据200与资源转移量100相加，得到的叠加数据为300。对于待处理数据(账户标识2，资源转移量2)，如图1B所示，由于账户标识2实际上与账户标识1相等，因此，f₁(账户标识2)＝f₁(账户标识1)＝1，则资源转移量2与资源转移量1的存储位置相同，即将资源转移量2叠加至预设数据列表中(1，1)所指示的位置处。假设存储资源转移量1后，(1，1)处的数据为300，资源转移量2为50，则对资源转移量2进行叠加存储后，(1，1)处的叠加数据为350。Then, according to at least one hash value corresponding to the data to be processed, theserver 130 superimposes the data to be processed on the data corresponding to the hash algorithm and the hash value in the preset data list, and obtains the superimposed data corresponding to each hash algorithm . Exemplarily, as shown in FIG. 1B , the hash value of the data to be processed (account identifier 1, resource transfer amount 1) obtained through the hash algorithm f₁ (x) is 1, then theresource transfer amount 1 is superimposed to the preset value. at the position indicated by (1, 1) in the data list. Assuming that the data stored at (1, 1) is 200 and theresource transfer amount 1 is 100, the superposition refers to adding thedata 200 stored at (1, 1) to theresource transfer amount 100, and the superimposed data obtained is 300 . For the data to be processed (account ID 2, resource transfer amount 2), as shown in FIG. 1B , sinceaccount ID 2 is actually equal to accountID 1, f₁ (account ID 2)=f₁ (account ID 1) =1, theresource transfer amount 2 is stored in the same location as theresource transfer amount 1, that is, theresource transfer amount 2 is superimposed on the position indicated by (1, 1) in the preset data list. Assuming that after theresource transfer amount 1 is stored, the data at (1, 1) is 300 and theresource transfer amount 2 is 50, then after theresource transfer amount 2 is superimposed and stored, the superimposed data at (1, 1) is 350.

接下来，服务器130对预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于各个哈希算法、且与设定指标匹配的目标叠加数据，并根据各个哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据，生成各个哈希算法对应的非目标数据。如前示例，设定指标为虚拟资源转移总额排名前三的账户的资源转移量，则目标叠加数据根据各个哈希算法所对应的叠加数据中，排名前三的叠加数据生成，目标叠加数据中包括虚拟资源转移总额排名前三的账户的资源转移量，以及虚拟资源转移总额除排名前三之外的账户的资源转移量。非目标数据根据除排名前三的叠加数据之外的其它叠加数据生成，非目标数据则包括虚拟资源转移总额除排名前三之外的账户的资源转移量。Next, theserver 130 sorts the superimposed data corresponding to each hash algorithm in the preset data list, obtains the target superimposed data corresponding to each hash algorithm and matches the set index according to the sorting result, and according to each hash algorithm For other superimposed data except the target superimposed data in the superimposed data corresponding to the algorithm, non-target data corresponding to each hash algorithm is generated. As in the previous example, the set indicator is the resource transfer amount of the top three accounts in the total virtual resource transfer amount, then the target overlay data is generated according to the top three overlay data among the overlay data corresponding to each hash algorithm, and the target overlay data is generated from the top three overlay data. It includes the resource transfer volume of the top three accounts in the total virtual resource transfer total, and the resource transfer volume of the accounts other than the top three virtual resource transfer total. The non-target data is generated based on other superimposed data except the top three superimposed data, and the non-target data includes the resource transfer amount of the accounts other than the top three in the total virtual resource transfer amount.

最后，服务器130根据目标叠加数据和非目标数据，计算多个待处理数据中与设定指标匹配的数据，以得到目标场景对应的数据处理结果。将目标叠加数据减去非目标数据，就可以得到虚拟资源转移总额排名前三的账户的资源转移量，也就得到了数据处理结果。示例性的，如图1B所示，根据预设数据列表存储的叠加数据获取目标叠加数据和非目标数据，然后根据目标叠加数据和非目标数据生成数据处理结果。Finally, theserver 130 calculates, according to the target overlay data and the non-target data, data matching the set index among the plurality of data to be processed, so as to obtain a data processing result corresponding to the target scene. By subtracting the non-target data from the target overlay data, the resource transfer amount of the top three accounts in the total virtual resource transfer amount can be obtained, and the data processing result can also be obtained. Exemplarily, as shown in FIG. 1B , target overlay data and non-target data are acquired according to overlay data stored in a preset data list, and then a data processing result is generated according to the target overlay data and non-target data.

在本申请的一个实施例中，目标场景还可以是网络访问场景，待处理数据可以是对网站及该网站对应的访问次数的记录，例如，待处理数据表示为(网站标识，访问次数)，示例性的，某网站IP地址(Internet Protocol Address，互联网协议地址)为1.2.3.4，访问次数为50次，则待处理数据表示为(1.2.3.4，50)。示例性的，图1C示意性地示出了本申请技术方案的一种应用场景的示意图。如图1C所示，对网站的点击次数按天统计，对于网站1而言，两天点击次数可分别形成待处理数据：(网站标识1，访问次数1)和(网站标识2，访问次数2)。同理，对于网站2而言，两天点击次数可分别形成待处理数据：(网站标识3，访问次数3)和(网站标识4，访问次数4)。可见，网站标识1和网站标识2虽然是不同的编号，但是二者对应的具体标识信息是相同的，都是网站1的标识信息(例如，都是网站1的IP地址)。同理，网站标识3和网站标识4虽然是不同的编号，但是二者对应的具体标识信息是相同的，都是网站2的标识信息(例如，都是网站2的IP地址)。In an embodiment of the present application, the target scenario may also be a network access scenario, and the data to be processed may be a record of a website and the number of visits corresponding to the website. For example, the data to be processed is represented as (website identifier, number of visits), Exemplarily, if the IP address (Internet Protocol Address, Internet Protocol Address) of a website is 1.2.3.4, and the number of visits is 50, the data to be processed is represented as (1.2.3.4, 50). Exemplarily, FIG. 1C schematically shows a schematic diagram of an application scenario of the technical solution of the present application. As shown in Figure 1C, the number of clicks on the website is counted by day. Forwebsite 1, the number of clicks in two days can form the data to be processed respectively: (site ID 1, number of visits 1) and (site ID 2, number of visits 2) ). Similarly, forwebsite 2, the number of clicks in two days can form the data to be processed: (website ID 3, number of visits 3) and (site ID 4, number of visits 4). It can be seen that althoughwebsite ID 1 andwebsite ID 2 are different numbers, the specific identification information corresponding to the two are the same, and both are the identification information of website 1 (for example, both are the IP addresses of website 1). Similarly, although the website ID 3 and thewebsite ID 4 are different numbers, the specific identification information corresponding to the two are the same, and both are the identification information of the website 2 (for example, both are the IP addresses of the website 2).

假设设定指标为总访问次数排名前五的网站。那么，服务器130根据本申请实施例提供的数据处理方法对网站访问数据进行处理，得到总访问次数排名前五的网站的相关数据，具体实施过程可以参考虚拟资源的转移场景中相关过程的描述，或参考后续实施例的描述，在此不再赘述。Suppose you set the metric to be the top five websites in terms of total visits. Then, theserver 130 processes the website access data according to the data processing method provided in the embodiment of the present application, and obtains the relevant data of the top five websites in the total number of visits. For the specific implementation process, refer to the description of the relevant process in the virtual resource transfer scenario, Or refer to the descriptions of subsequent embodiments, which are not repeated here.

在本申请的一个实施例中，目标场景还可以是信息推荐场景，待处理数据可以是对信息及该信息对应的点击次数的记录，例如，待处理数据表示为(信息标识，访问次数)，示例性的，某信息标识为111，点击次数为10次，则待处理数据表示为(111，10)。设定指标为总点击次数排名前十的信息。那么，服务器130根据本申请实施例提供的数据处理方法对信息点击数据进行处理，得到总点击次数排名前十的信息的相关数据，具体实施过程可以参考虚拟资源的转移场景中相关过程的描述，或参考后续实施例的描述，在此不再赘述。In an embodiment of the present application, the target scene may also be an information recommendation scene, and the data to be processed may be a record of the information and the number of clicks corresponding to the information. For example, the data to be processed is represented as (information identifier, number of visits), Exemplarily, if a certain piece of information is identified as 111 and the number of clicks is 10, the data to be processed is represented as (111, 10). Set the indicator to the top ten information in total clicks. Then, theserver 130 processes the information click data according to the data processing method provided in the embodiment of the present application, and obtains the relevant data of the top ten information in the total number of clicks. For the specific implementation process, refer to the description of the relevant process in the virtual resource transfer scenario, Or refer to the descriptions of subsequent embodiments, which are not repeated here.

下面结合具体实施方式对本申请提供的数据处理做出详细说明。The data processing provided by the present application will be described in detail below with reference to specific embodiments.

图2示意性地示出了本申请一个实施例提供的数据处理方法的流程图，如图2所示，该方法包括步骤210至步骤240，具体如下：FIG. 2 schematically shows a flowchart of a data processing method provided by an embodiment of the present application. As shown in FIG. 2 , the method includessteps 210 to 240, and the details are as follows:

步骤210、通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值。Step 210: Perform hash operation processing on each data to be processed in the target scene by at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed.

具体的，目标场景可以是任意需要进行批量数据处理或分析的场景，如虚拟资源的转移场景、网络访问场景、信息推荐场景、市场数据分析场景等。目标场景中的待处理数据是该场景中的相关信息的原始记录数据，例如，虚拟资源转移场景的待处理数据是每一条资源转移信息的记录数据，网络访问场景的待处理数据是每一个网站的访问信息的记录数据。Specifically, the target scenario may be any scenario that requires batch data processing or analysis, such as a virtual resource transfer scenario, a network access scenario, an information recommendation scenario, a market data analysis scenario, and the like. The data to be processed in the target scene is the original record data of the relevant information in the scene. For example, the data to be processed in the virtual resource transfer scenario is the recorded data of each resource transfer information, and the data to be processed in the network access scenario is the data of each website. record data of access information.

哈希算法可以将占用较大空间的数据映射为占用较小空间的数据，经过哈希算法处理的数据更为紧凑，故而占用的存储空间减少。通过至少一种哈希算法对各个待处理数据进行哈希运算处理，是指对于任一个待处理数据，均需通过各种哈希算法的运算。一个待处理数据经过一种哈希算法得到一个哈希值，那么经过至少一种哈希算法的处理后，得到至少一个哈希值。The hash algorithm can map data that occupies a large space into data that occupies a small space, and the data processed by the hash algorithm is more compact, so the storage space occupied is reduced. Performing hash operation processing on each data to be processed by at least one hash algorithm means that for any data to be processed, operations of various hash algorithms are required. A hash value is obtained from a piece of data to be processed through one hash algorithm, and then at least one hash value is obtained after being processed by at least one hash algorithm.

在本申请的一个实施例中，待处理数据为键值对形式的数据，表现为(key，value)的形式，key为待处理数据的键，value为待处理数据的值，一般的，根据key，可以查找到对应的value。本申请实施例中，通过哈希算法对待处理数据进行哈希运算处理，是使用哈希算法对待处理数据中的key进行运算，得到对应的哈希值。In one embodiment of the present application, the data to be processed is data in the form of a key-value pair, which is expressed in the form of (key, value), where key is the key of the data to be processed, and value is the value of the data to be processed. Generally, according to key, you can find the corresponding value. In the embodiment of the present application, the hash operation is performed on the data to be processed by the hash algorithm, and the key in the data to be processed is operated by using the hash algorithm to obtain the corresponding hash value.

步骤220、根据待处理数据对应的至少一个哈希值，将待处理数据叠加至预设数据列表中与哈希算法和哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据。Step 220: According to at least one hash value corresponding to the data to be processed, superimpose the data to be processed on the data at the position corresponding to the hash algorithm and the hash value in the preset data list to obtain the superimposed data corresponding to each hash algorithm.

具体的，在得到待处理数据对应的哈希值后，采用叠加存储的方式将待处理数据存储至预设数据列表中。叠加存储是指，将待处理数据与预设列表中相应位置已存储的数据叠加，得到该位置新的存储数据。Specifically, after obtaining the hash value corresponding to the data to be processed, the data to be processed is stored in the preset data list by means of superimposed storage. Overlay storage refers to overlaying the data to be processed with the data already stored in the corresponding position in the preset list to obtain new stored data at the position.

待处理数据在预设数据列表中的存储位置由哈希算法和哈希值共同确定。预设数据列表可以看成是一个由多行和多列构成的数据存储表格，哈希算法和哈希值分别用于确定待处理数据存储位置的行和列中的一个。例如，预设数据列表的一行数据，对应于同一哈希算法的计算得到，则哈希算法编号可以表示待处理数据存储位置的行号；预设数据列表的一列数据，对应于同一哈希值，则哈希值可以表示待处理数据存储位置的列号。当然，也可以通过哈希算法编号表示待处理数据存储位置的列号，通过哈希值表示待处理数据存储位置的行号。The storage location of the data to be processed in the preset data list is jointly determined by the hash algorithm and the hash value. The preset data list can be regarded as a data storage table composed of multiple rows and multiple columns, and the hash algorithm and the hash value are respectively used to determine one of the rows and columns of the storage location of the data to be processed. For example, a row of data in the preset data list corresponds to the calculation of the same hash algorithm, and the hash algorithm number may indicate the row number of the storage location of the data to be processed; a column of data in the preset data list corresponds to the same hash value , the hash value can represent the column number of the storage location of the data to be processed. Of course, the column number of the storage location of the data to be processed can also be represented by the hash algorithm number, and the row number of the storage location of the data to be processed can be represented by the hash value.

在本申请的一个实施例中，待处理数据为键值对形式的数据，在进行叠加存储时，是将待处理数据中的值(value)叠加存储至预设数据列表中相应的存储位置。In an embodiment of the present application, the data to be processed is data in the form of key-value pairs, and when superimposed storage is performed, the value in the data to be processed is superimposed and stored to a corresponding storage location in the preset data list.

在本申请的一个实施例中，图4示意性地示出了一种预设数据列表的示意图。假设共有m种哈希算法，记为f₁(x)、f₂(x)…f_m(x)，以哈希算法编号表示待处理数据存储位置的行号，以哈希值表示待处理数据存储位置的列号，待处理数据的存储位置表示为(行号，列号)。其中，预设数据列表的总列数为预先设定的。In an embodiment of the present application, FIG. 4 schematically shows a schematic diagram of a preset data list. Assuming that there are m hash algorithms, denoted as f₁ (x), f₂ (x)...f_m (x), the hash algorithm number indicates the row number of the storage location of the data to be processed, and the hash value indicates the pending processing. The column number of the data storage location, the storage location of the data to be processed is represented as (row number, column number). The total number of columns in the preset data list is preset.

如图4所示，对于待处理数据(key，value)，f₁(key)＝4，那么该待处理数据的一个存储位置为(1，4)，则将待处理数据的value与第1行第4列位置处的数据叠加存储，即(1，4)位置处的数据+value。f₂(key)＝2，那么该待处理数据的一个存储位置为(2，2)，则将待处理数据的value与第2行第2列位置处的数据叠加存储，即(2，2)位置处的数据+value。f_m(key)＝5，那么该待处理数据的一个存储位置为(m，5)，则将待处理数据的value与第m行第5列位置处的数据叠加存储，即(m，5)位置处的数据+value。As shown in Figure 4, for the data to be processed (key, value), f₁ (key)=4, then a storage location of the data to be processed is (1, 4), then the value of the data to be processed is compared with the first The data at the row 4th column position is superimposed and stored, that is, the data at the (1, 4) position + value. f₂ (key)=2, then a storage location of the data to be processed is (2, 2), then the value of the data to be processed is superimposed and stored with the data at the second row and second column position, that is (2, 2 ) data + value at the position. f_m (key)=5, then a storage location of the data to be processed is (m, 5), then the value of the data to be processed is superimposed and stored with the data at the position of the mth row and the fifth column, that is (m, 5 ) data + value at the position.

可见，待处理数据(key，value)经过m种哈希算法的运算，将分别存储至预设数据列表中每一种哈希算法所对应的一行中，预设数据列表中每一种哈希算法所对应的数据，均包括多个存储位置的叠加数据(也就是多列叠加数据)。It can be seen that the data (key, value) to be processed will be stored in a row corresponding to each hash algorithm in the preset data list after the operation of m hash algorithms, and each hash algorithm in the preset data list will be stored in a row corresponding to each hash algorithm. The data corresponding to the algorithm includes superimposed data of multiple storage locations (that is, multi-column superimposed data).

步骤230、对预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于各个哈希算法、且与设定指标匹配的目标叠加数据，并根据各个哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据，生成各个哈希算法对应的非目标数据。Step 230: Sort the superimposed data corresponding to each hash algorithm in the preset data list, obtain the target superimposed data corresponding to each hash algorithm and match the set index according to the sorting result, and corresponding to each hash algorithm. For other superimposed data except the target superimposed data in the superimposed data, non-target data corresponding to each hash algorithm is generated.

具体的，排序是针对每一个哈希算法对应的多个叠加数据来排序，可以将每一个哈希算法对应的多个叠加数据按照从大到小排列，也可以按照从小到大排列。设定指标通常用来限定数据的选取范围，例如，设定指标为某种数据的头部指标数据，示例性的，设定指标为总访问次数排名前3的网站的访问量之和，也表示为访问次数Top3网站的总访量。Specifically, the sorting is to sort the multiple overlapping data corresponding to each hash algorithm, and the multiple overlapping data corresponding to each hash algorithm can be arranged in ascending order, or can be arranged in descending order. Setting the indicator is usually used to limit the selection range of data. For example, the indicator is set as the header indicator data of a certain data. Exemplarily, the indicator is set as the sum of the visits of the top 3 websites in the total number of visits. Expressed as the total number of visits to the top 3 websites.

经过排序后，各个哈希算法对应的多个叠加数据已按照大小排列，那么就可以较为方便的从各个哈希算法对应的多个叠加数据中，获取与设定指标匹配的目标叠加数据。例如，将各个哈希算法对应的多个叠加数据按照从大到小排列，那么根据排序结果获取排名前3的叠加数据，就可以得到目标叠加数据。After sorting, the multiple overlay data corresponding to each hash algorithm has been arranged according to size, so it is more convenient to obtain the target overlay data matching the set index from the multiple overlay data corresponding to each hash algorithm. For example, by arranging multiple overlay data corresponding to each hash algorithm in descending order, and obtaining the top 3 overlay data according to the sorting result, the target overlay data can be obtained.

在本申请实施例中，由于预设数据列表中的各个存储位置的叠加数据是多个待处理数据叠加存储的结果，那么目标叠加数据中不仅包括与设定指标匹配的待处理数据，还包括部分与设定指标不匹配的待处理数据，而这部分与设定指标不匹配的待处理数据，是数据处理过程中需要舍弃的数据。例如，待处理数据为网站访问数据，各个哈希算法对应的目标叠加数据中包括总访问次数排名前3的网站的访问量之和，同时也包括总访问次数排名在前3之后的部分网站的访问量之和。示例性的，对于哈希算法f₁(x)，对于不相同的key1和key2，可能存在f₁(key1)＝f₁(key2)，那么对应的value1和value2叠加存储在同一位置。若key1是总访问次数排名前3的一个网站，则与设定指标匹配的待处理数据为value1，而目标叠加数据包含的是该存储位置的叠加数据，也就是value1+value2的和。因此，需要从目标叠加数据中舍去value2，才能够得到与设定指标匹配的待处理数据value1。In the embodiment of the present application, since the superimposed data of each storage location in the preset data list is the result of superimposed storage of a plurality of data to be processed, the target superimposed data not only includes the data to be processed that matches the set index, but also includes Part of the pending data that does not match the set index, and this part of the pending data that does not match the set index, is the data that needs to be discarded in the data processing process. For example, the data to be processed is website visit data, and the target overlay data corresponding to each hash algorithm includes the sum of the visits of the top 3 websites in terms of total visits, and also includes the sum of visits of some websites ranked after the top 3 in terms of total visits. Sum of visits. Exemplarily, for the hash algorithm f₁ (x), for different key1 and key2, there may be f₁ (key1)=f₁ (key2), then the corresponding value1 and value2 are superimposed and stored in the same location. If key1 is a website with the top 3 total visits, the data to be processed that matches the set index is value1, and the target overlay data contains the overlay data of the storage location, that is, the sum of value1+value2. Therefore, it is necessary to discard value2 from the target overlay data in order to obtain the pending data value1 that matches the set index.

对于各个哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据，这些数据是与设定指标不匹配的数据，那么通过这些与设定指标不匹配的数据，就可以估算出目标叠加数据中与设定指标不匹配的待处理数据，该数据就是非目标数据。For other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm, these data are data that do not match the set indicators, then through these data that do not match the set indicators, the target can be estimated. The pending data in the superimposed data does not match the set index, the data is the non-target data.

步骤240、根据目标叠加数据和非目标数据，计算多个待处理数据中与设定指标匹配的数据，以得到目标场景对应的数据处理结果。Step 240 , according to the target overlay data and the non-target data, calculate the data matching the set index among the plurality of data to be processed, so as to obtain the data processing result corresponding to the target scene.

根据前述分析可知，非目标数据为目标叠加数据中需要舍弃的数据，那么，将目标叠加数据与非目标数据相减，就得到了目标数据，该目标数据就是与设定指标匹配的待处理数据。According to the above analysis, the non-target data is the data that needs to be discarded in the target overlay data. Then, the target data is obtained by subtracting the target overlay data from the non-target data, and the target data is the data to be processed that matches the set index. .

由于每个哈希算法所对应的叠加数据，都是对目标场景对应的所有待处理数据进行哈希运算处理后得到的，也就是说，根据一个哈希算法所得到的目标数据，实际上就是目标场景的与设定指标匹配的待处理数据，即目标场景对应的数据处理结果。因此，当使用多种哈希算法时，可以从各个哈希算法所得到的目标数据中选择一个，作为目标场景对应的数据处理结果。可选的，为了提高数据处理精度，也可以对各个哈希算法所得到的目标数据进行统计处理，然后将统计处理的结果作为目标场景对应的数据处理结果，例如，将各个哈希算法所得到的目标数据的均值作为目标场景对应的数据处理结果。Since the superimposed data corresponding to each hash algorithm is obtained by hashing all the data to be processed corresponding to the target scene, that is to say, the target data obtained according to a hash algorithm is actually The to-be-processed data of the target scene that matches the set index, that is, the data processing result corresponding to the target scene. Therefore, when multiple hash algorithms are used, one of the target data obtained by each hash algorithm can be selected as the data processing result corresponding to the target scene. Optionally, in order to improve the data processing accuracy, statistical processing can also be performed on the target data obtained by each hash algorithm, and then the statistical processing result is used as the data processing result corresponding to the target scene. The mean value of the target data is taken as the data processing result corresponding to the target scene.

图3示意性地示出了本申请一个实施例提供的数据处理方法的流程图，该方法是对上述实施例的进一步细化。如图3所示，本申请实施例提供的数据处理方法包括步骤310至步骤390，具体如下：FIG. 3 schematically shows a flowchart of a data processing method provided by an embodiment of the present application, and the method is a further refinement of the foregoing embodiment. As shown in FIG. 3 , the data processing method provided by this embodiment of the present application includes steps 310 to 390, and the details are as follows:

步骤310、通过至少一种哈希算法对目标场景中的各个待处理数据的键进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值。Step 310: Perform hash operation processing on the keys of each data to be processed in the target scene by using at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed.

具体的，待处理数据包括键值对形式的数据，即待处理数据由键(key)和值(value)构成，表示为(key，value)。Specifically, the data to be processed includes data in the form of key-value pairs, that is, the data to be processed is composed of a key (key) and a value (value), and is represented as (key, value).

示例性的，目标场景的待处理数据包括：Exemplarily, the data to be processed of the target scene includes:

(key₁，value₁)，(key₂，value₂)…(key_n，value_n)(key₁ , value₁ ), (key₂ , value₂ )…(key_n , value_n )

其中，key的编号不同，用以体现两个待处理数据是不同的两条数据记录，key的具体数值是可以相同的。例如，在网络访问场景中，key表示网站标识，value表示网站访问次数，则可以存在key₁＝key₂，这就表示(key₁，value₁)和(key₂，value₂)是对于同一网站的记录，比如，(key₁，value₁)记录的是网站1在前一小时内的访问次数，(key₂，value₂)记录的是网站1在后一小时内的访问次数。Among them, the number of the key is different to reflect that the two data to be processed are two different data records, and the specific value of the key can be the same. For example, in a network access scenario, the key represents the website identifier and the value represents the number of website visits, then there can be key₁ =key₂ , which means that (key₁ , value₁ ) and (key₂ , value₂ ) are for the same website For example, (key₁ , value₁ ) records the number of visits towebsite 1 in the previous hour, and (key₂ , value₂ ) records the number of visits towebsite 1 in the next hour.

一般的，对目标场景的待处理数据进行处理，是将待处理数据中，具有相同key的value累加，得到累加值，如下式(1)所示：Generally, to process the data to be processed in the target scene is to accumulate the values with the same key in the data to be processed to obtain the accumulated value, as shown in the following formula (1):

其中，v_i表示value_i，s_k表示key为k的待处理数据的value的和。式(1)的意思是当key＝k时，对对应的所有value求和。Among them, v_i represents value_i , and s_k represents the sum of the values of the data to be processed whose key is k. Equation (1) means that when key=k, sum all corresponding values.

然后从多个累加值中求取排名前N的累加值的和(简记为TOPN累加)，记为：Then, the sum of the top N accumulated values (abbreviated as TOPN accumulation) is obtained from multiple accumulated values, which is recorded as:

其中，TOPN表示排名前N。TOPN累加为最终所需的数据处理结果。Among them, TOPN represents the top N. TOPN accumulates for the final desired data processing result.

在本申请的一个实施例中，一种哈希算法包括一次哈希函数运算和一次取模运算，各个哈希算法所对应的哈希函数运算不同，而取模运算相同，哈希值的具体计算过程包括：通过至少一种哈希函数运算对目标场景中的各个待处理数据的键进行哈希计算，得到各个待处理数据对应的哈希结果；将各个待处理数据对应的哈希结果对预设哈希分桶数进行取模运算，得到各个待处理数据对应的至少一个哈希值。In an embodiment of the present application, a hash algorithm includes a hash function operation and a modulo operation. The hash function operations corresponding to each hash algorithm are different, but the modulo operation is the same. The calculation process includes: performing hash calculation on the keys of each data to be processed in the target scene through at least one hash function operation to obtain a hash result corresponding to each data to be processed; A modulo operation is performed on the preset number of hash buckets to obtain at least one hash value corresponding to each data to be processed.

将各个哈希函数记为h₁(x)、h₂(x)…h_m(x)，对预设哈希分桶数进行取模运算即为mod b，其中，b表示预设哈希分桶数，预设哈希分桶数的定义参考步骤320中的相关描述。则哈希值的计算公式如下式(2)所示：Denote each hash function as h₁ (x), h₂ (x)…h_m (x), and mod b is performed on the number of preset hash buckets, where b represents the preset hash For the number of buckets, refer to the relevant description instep 320 for the definition of the number of preset hash buckets. Then the calculation formula of the hash value is shown in the following formula (2):

δ_i,j＝h_j(key_i)mod b (2)δ_i,j =h_j (key_i )mod b (2)

其中，h_j(key_i)表示通过第j个哈希函数对第i个待处理数据的键key_i进行哈希计算后的哈希结果。δ_i,j表示第i个待处理数据的键key_i经过第j个哈希函数的计算后，对预设哈希分桶数b取模所得到的哈希值，δ_i,j又称为一个sketch位。由此可见，待处理数据对应的哈希值为小于或等于预设哈希分桶数的整数。Wherein, h_j (key_i ) represents the hash result obtained by hashing the key key_i of the i-th data to be processed by the j-th hash function. δ_i,j represents the hash value obtained by modulo the preset hash bucket number b after the key_i of the i-th data to be processed is calculated by the j-th hash function, δ_i,j is also known as for a sketch bit. It can be seen that the hash value corresponding to the data to be processed is an integer less than or equal to the preset number of hash buckets.

由于各个哈希算法的哈希函数运算不同，而取模运算相同，因此，也可以通过不同的哈希函数来表示不同的哈希运算。例如，以哈希函数h₁(x)表示哈希运算f₁(x)。Since the hash function operations of each hash algorithm are different, and the modulo operation is the same, different hash functions can also be used to represent different hash operations. For example, the hash operation f₁ (x) is represented by the hash function h₁ (x).

步骤320、根据待处理数据对应的至少一个哈希值，将待处理数据的值叠加至预设数据列表中与哈希算法和哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据。Step 320: According to at least one hash value corresponding to the data to be processed, the value of the data to be processed is superimposed on the data at the position corresponding to the hash algorithm and the hash value in the preset data list to obtain the superposition corresponding to each hash algorithm. data.

具体的，将预设数据列表看成一个由多行和多列构成的数据存储表格，则预设数据列表的行数由哈希算法数量确定，预设数据列表的列数由预设哈希分桶数确定(当然，也可以是哈希算法确定列数，预设哈希分桶数确定行数，在此不过多赘述)。预设哈希分桶数是指预先设定的哈希桶的数量，一个哈希桶相当于一个固定哈希值所对应的存储链表，该存储链表所存储的是经由各个哈希算法处理得到的叠加数据。由于一个哈希桶所占用的存储空间是一定的，故而预设哈希分桶数实际上反映了预设数据列表所占空存储空间的大小。Specifically, considering the preset data list as a data storage table composed of multiple rows and columns, the number of rows in the preset data list is determined by the number of hash algorithms, and the number of columns in the preset data list is determined by the preset hash algorithm. The number of buckets is determined (of course, the number of columns can also be determined by a hash algorithm, and the number of rows is determined by the preset number of hash buckets, which will not be described here). The preset number of hash buckets refers to the number of preset hash buckets. A hash bucket is equivalent to a storage linked list corresponding to a fixed hash value. overlay data. Since the storage space occupied by one hash bucket is certain, the preset number of hash buckets actually reflects the size of the vacant storage space occupied by the preset data list.

示例性的，图4示意性地示出了一种预设数据列表的示意图，在该预设数据列表结构中，预设哈希分桶数为b，相当于预设数据列表共有b列；哈希算法共有m个，记为f₁(x)、f₂(x)…f_m(x)，相当于预设数据列表共有m行。Exemplarily, FIG. 4 schematically shows a schematic diagram of a preset data list. In the preset data list structure, the preset hash bucket number is b, which is equivalent to a total of b columns in the preset data list; There are m hash algorithms in total, denoted as f₁ (x), f₂ (x)...f_m (x), which is equivalent to a total of m lines in the preset data list.

在本申请的一个实施例中，累加数据的计算公式如下式(3)所示：In an embodiment of the present application, the calculation formula of the accumulated data is shown in the following formula (3):

其中，hs_j,l第j个哈希算法，且哈希值为l所对应的累加数据，相当于预设数据列表中第j行第l列处的叠加数据。式(3)所表示的意思是，当第j个哈希算法计算得到哈希值为l时，将对应的所有待处理数据的value求和。Wherein, hs_j,l is the jth hash algorithm, and the hash value is the accumulated data corresponding to l, which is equivalent to the superimposed data at the jth row and the lth column in the preset data list. Equation (3) means that when the hash value is l calculated by the j-th hash algorithm, the values of all the corresponding data to be processed are summed.

例如，对于图4所示的待处理数据(key，value)，f₁(key)＝4，则将value叠加至(1，4)位置；f₂(key)＝2，则将value叠加至(2，2)位置；f_m(key)＝5，则将value叠加至(m，5)位置。For example, for the data to be processed (key, value) shown in FIG. 4, f₁ (key)=4, then the value is superimposed to the (1, 4) position; f₂ (key)=2, then the value is superimposed to (2, 2) position; f_m (key)=5, then the value is superimposed to the (m, 5) position.

这种数据存储方式无需对所有待处理数据进行逐条存储，而只存储中间结果(即叠加数据)，故而达到了通过少量中间结果(相当于有限存储资源)进行大量头部指标数据计算的效果。This data storage method does not need to store all the data to be processed one by one, but only stores the intermediate results (that is, superimposed data), so it achieves the effect of calculating a large amount of header index data through a small number of intermediate results (equivalent to limited storage resources).

步骤330、对预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于各个哈希算法、且与设定指标匹配的设定数量个哈希桶所存储的叠加数据。Step 330: Sort the superimposed data corresponding to each hash algorithm in the preset data list, and obtain the superposition stored in the set number of hash buckets corresponding to each hash algorithm and matching the set index according to the sorting result. data.

具体的，本步骤主要是从各个哈希算法对应的叠加数据中，提取TOPN哈希桶中的叠加数据。示例性的，如图4所示的待处理数据列表，对于每一行中的每一个格子，都相当于对应哈希算法的一个哈希桶。TOPN哈希桶中的叠加数据，就是获取前N个哈希桶的叠加数据。Specifically, this step mainly extracts the superimposed data in the TOPN hash bucket from the superimposed data corresponding to each hash algorithm. Exemplarily, as shown in the data list to be processed as shown in FIG. 4 , for each grid in each row, it is equivalent to a hash bucket corresponding to the hash algorithm. The superimposed data in the TOPN hash bucket is to obtain the superimposed data of the first N hash buckets.

示例性的，图5示意性地示出了某一个哈希算法对应的多个叠加数据的排序结果的示意图。如图5所示，每一个柱形条代表一个哈希桶的叠加数据，假设TOPN为TOP3，则与设定指标匹配的3个哈希桶的叠加数据为图5中虚线框内的3个柱形条所代表的叠加数据。Exemplarily, FIG. 5 schematically shows a schematic diagram of sorting results of multiple overlapping data corresponding to a certain hash algorithm. As shown in Figure 5, each bar represents the superimposed data of one hash bucket. Assuming that TOPN is TOP3, the superimposed data of the three hash buckets matching the set index are the three in the dotted box in Figure 5. Overlaid data represented by the bars.

步骤340、对设定数量个哈希桶所存储的叠加数据求和，作为各个哈希算法对应的目标叠加数据。Step 340: Sum the superimposed data stored in the set number of hash buckets as the target superimposed data corresponding to each hash algorithm.

具体的，将提取的TOPN哈希桶中的叠加数据求和，得到目标叠加数据，可以表示为：Specifically, sum the superimposed data in the extracted TOPN hash buckets to obtain the target superimposed data, which can be expressed as:

其中，hs_l表示第l个哈希桶中的叠加数据。Among them, hs_l represents the superimposed data in the l-th hash bucket.

示例性的，在图5所示的排序结果中，TOP3对应的目标叠加数据为虚线框内的3个柱形条的数据之和。Exemplarily, in the sorting result shown in FIG. 5 , the target overlay data corresponding to TOP3 is the sum of the data of the three column bars in the dashed box.

在图5所示的柱形图中，一个柱形条由至少一个方格构成，一个方格实际上代表一个key所对应的所有value之和，以网站访问场景为例，一个方格代表一个网站所对应的访问次数之和。各个方格所对应的key是不同的。在图5中，虚线框中最大面积的3个方格对应的数据之和(图5中阴影部分所示)，为最终所需的TOP3访问次数网站的总访问量，也就是TOPN累加。虚线框中除最大面积的3个方格之外的其他方格对应的数据，为非目标数据。In the column chart shown in Figure 5, a column bar consists of at least one square, and a square actually represents the sum of all values corresponding to a key. Taking the website visit scenario as an example, a square represents a The sum of the number of visits to the website. The keys corresponding to each square are different. In Figure 5, the sum of the data corresponding to the 3 squares with the largest area in the dashed box (shown in the shaded part in Figure 5) is the total number of website visits required for the final TOP3 visits, that is, the TOPN accumulation. The data corresponding to other squares except the 3 squares with the largest area in the dotted line box are non-target data.

步骤350、根据哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据及其它叠加数据对应的待处理数据数量，生成非目标数据的数值期望。Step 350: Generate a numerical expectation of the non-target data according to other superimposed data except the target superimposed data in the superimposed data corresponding to the hash algorithm and the number of data to be processed corresponding to the other superimposed data.

具体的，哈希算法对应的叠加数据中，除目标叠加数据之外的数据为其他叠加数据，如图5中虚线框之外的柱形条所代表的数据。本申请通过其他叠加数据及其对应的待处理数据量，来计算非目标数据的数值期望，非目标数据的数值期望是指非目标数据中value的期望。Specifically, in the superimposed data corresponding to the hash algorithm, the data other than the target superimposed data is other superimposed data, such as the data represented by the column bars other than the dotted box in FIG. 5 . The present application calculates the numerical expectation of non-target data by using other superimposed data and the corresponding amount of data to be processed. The numerical expectation of non-target data refers to the expectation of value in non-target data.

下面以一个哈希算法为例，来说明非目标数据的计算过程。非目标数据的数值期望的计算过程包括：对哈希算法对应的叠加数据中除目标叠加数据之外的其它叠加数据求和，得到哈希算法对应的非目标叠加数据；根据其它叠加数据对应的待处理数据量，确定其它叠加数据对应的待处理数据去重数；根据非目标叠加数据和其它叠加数据对应的待处理数据去重数的比值，得到非目标数据数值期望。The following takes a hash algorithm as an example to illustrate the calculation process of non-target data. The calculation process of the numerical expectation of the non-target data includes: summing other superimposed data except the target superimposed data in the superimposed data corresponding to the hash algorithm to obtain the non-target superimposed data corresponding to the hash algorithm; For the amount of data to be processed, determine the deduplication number of the data to be processed corresponding to the other superimposed data; according to the ratio of the non-target superimposed data and the deduplication number of the data to be processed corresponding to the other superimposed data, the numerical expectation of the non-target data is obtained.

具体的，将其他叠加数据的和作为非目标叠加数据，非目标叠加数据可以表示为：Specifically, taking the sum of other overlay data as the non-target overlay data, the non-target overlay data can be expressed as:

其中，hs_l表示第l个哈希桶所存储的叠加数据。当第l个哈希桶不属于TOPN哈希桶时，该哈希桶的叠加数据为其他叠加数据，对所有不属于TOPN哈希桶的叠加数据求和，得到非目标叠加数据。Among them, hs_l represents the superimposed data stored in the l-th hash bucket. When the lth hash bucket does not belong to the TOPN hash bucket, the superimposed data of this hash bucket is other superimposed data, and all the superimposed data that do not belong to the TOPN hash bucket are summed to obtain the non-target superimposed data.

待处理数据去重数是指根据key进行去重处理后的待处理数据的数量，也就是对key去重后key的数量，待处理数据去重数相当于key的种类。其它叠加数据对应的待处理数据去重数，则是对其它叠加数据对应的待处理数据总量中的key进行去重后得到的key的数量。例如，100个待处理数据对应100个key，对key去重后得到50个key，则待处理数据去重数为50。The number of deduplication of data to be processed refers to the number of data to be processed after deduplication according to the key, that is, the number of keys after the key is deduplicated. The number of deduplicated data to be processed is equivalent to the type of key. The number of deduplicated data to be processed corresponding to other superimposed data is the number of keys obtained by deduplicating keys in the total amount of data to be processed corresponding to other superimposed data. For example, 100 pieces of data to be processed correspond to 100 keys, and 50 keys are obtained after deduplication of keys, and the number of deduplication of data to be processed is 50.

在本申请的一个实施例中，其它叠加数据对应的待处理数据去重数的计算方式为：根据待处理数据的键，对目标场景中的多个待处理数据进行去重处理，得到待处理数据去重总数；根据待处理数据的键，对目标叠加数据对应的多个待处理数据进行去重处理，得到目标叠加数据对应的待处理数据去重数；根据待处理数据去重总数和目标叠加数据对应的待处理数据去重数的差值，得到其它叠加数据对应的待处理数据去重数。In an embodiment of the present application, the calculation method of the deduplication number of the data to be processed corresponding to other superimposed data is: according to the key of the data to be processed, perform deduplication processing on a plurality of data to be processed in the target scene, and obtain the data to be processed. The total number of data deduplication; according to the key of the data to be processed, perform deduplication processing on a plurality of data to be processed corresponding to the target overlay data to obtain the deduplication number of the data to be processed corresponding to the target overlay data; according to the total number of data deduplication and the target The difference between the deduplication numbers of the data to be processed corresponding to the superimposed data is used to obtain the deduplication numbers of the data to be processed corresponding to other superimposed data.

也就是说，先得出与其他叠加数据相对立的目标叠加数据的待处理数据去重数，然后将目标场景中的待处理数据去重总数，减去目标叠加数据所对应的待处理数据去重数，就可以得到其它叠加数据对应的待处理数据去重数。这是因为，其他叠加数据对应的待处理数据去重数通常大于目标叠加数据所对应的待处理数据去重数(因为目标叠加数据是TOPN数据，而其他叠加数据是TOPN之外的数据)，通过目标叠加数据所对应的待处理数据去重数来计算其它叠加数据对应的待处理数据去重数，将比直接对其他叠加数据对应的待处理数据进行去重处理所处理的数据量小，进而可以提高数据处理速度。That is to say, first obtain the deduplication number of the target superimposed data that is opposite to other superimposed data to be processed, and then subtract the total number of data to be deduplicated in the target scene from the data to be processed corresponding to the target superimposed data. Multiplicity, the demultiplication number of the data to be processed corresponding to other superimposed data can be obtained. This is because the deduplication number of the data to be processed corresponding to other superimposed data is usually larger than the deduplication number of data to be processed corresponding to the target superimposed data (because the target superimposed data is TOPN data, and other superimposed data is data other than TOPN), Calculating the deduplication number of the data to be processed corresponding to the other superimposed data by the deduplication number of the data to be processed corresponding to the target superimposed data will be smaller than the amount of data processed by directly deduplicating the data to be processed corresponding to the other superimposed data. In turn, the data processing speed can be improved.

示例性的，非目标数据的数值期望的计算公式如下式(4)所示：Exemplarily, the calculation formula of the numerical expectation of the non-target data is shown in the following formula (4):

其中，K表示目标场景中的待处理数据去重总数；

表示目标叠加数据所对应的待处理数据去重数，也就是TOPN哈希桶中的key的去重个数；

表示其它叠加数据对应的待处理数据去重数，即TOPN以外哈希桶中key的去重个数。Among them, K represents the total number of data deduplication to be processed in the target scene;

Indicates the number of data deduplication to be processed corresponding to the target overlay data, that is, the deduplication number of keys in the TOPN hash bucket;

Indicates the deduplication number of data to be processed corresponding to other superimposed data, that is, the deduplication number of keys in hash buckets other than TOPN.

步骤360、根据目标叠加数据对应的待处理数据量，计算非目标数据的数量期望。Step 360: Calculate the expected quantity of non-target data according to the amount of data to be processed corresponding to the target overlay data.

具体的，非目标数据的数量期望是指非目标数据对应的待处理数据量的期望。非目标数据的数量期望可以通过目标叠加数据的数量期望与设定指标数值之差得到，即TOPN哈希桶的数量期望减去N，得到非目标数据的数量期望，如下式(5)所示：Specifically, the expected quantity of non-target data refers to the expected amount of data to be processed corresponding to the non-target data. The expected quantity of non-target data can be obtained by the difference between the expected quantity of target overlay data and the set index value, that is, the expected quantity of TOPN hash buckets minus N to obtain the expected quantity of non-target data, as shown in the following formula (5) :

其中，

表示目标叠加数据的数量期望，N表示设定指标的具体数值。in,

Represents the expected quantity of target overlay data, and N represents the specific value of the set index.

在本申请的一个实施例中，根据预设哈希分桶数、目标场景对应的待处理数据去重总数和预设拟合函数，生成目标叠加数据的数量期望。示例性的，K表示目标场景对应的待处理数据去重总数；b表示预设哈希分桶数，预设拟合函数可以表示为

将本申请实施例的待处理数据去重总数和预设哈希分桶数代入该预设拟合函数中，得到目标叠加数据的数量期望

In an embodiment of the present application, the expected quantity of target overlay data is generated according to the preset number of hash buckets, the total number of deduplication data to be processed corresponding to the target scene, and a preset fitting function. Exemplarily, K represents the total number of deduplication data to be processed corresponding to the target scene; b represents the number of preset hash buckets, and the preset fitting function can be expressed as

Substitute the total number of deduplication data to be processed and the preset hash bucket number in the embodiment of the present application into the preset fitting function to obtain the expected quantity of target superimposed data

在本申请的一个实施例中，预设拟合函数的生成过程包括：根据预设哈希分桶数、目标场景对应的待处理数据去重总数和待定拟合系数，构建与哈希分桶数和待处理数据去重总数相关的拟合函数；通过样本数据对拟合函数进行训练，以得到待定拟合系数的数值；根据预设哈希分桶数、目标场景对应的待处理数据去重总数和待定拟合系数的数值生成预设拟合函数。In an embodiment of the present application, the generation process of the preset fitting function includes: according to the preset number of hash buckets, the total number of deduplication data to be processed corresponding to the target scene, and the undetermined fitting coefficient, constructing and hash bucketing The fitting function related to the number of data and the total number of deduplication of the data to be processed; the fitting function is trained by the sample data to obtain the value of the undetermined fitting coefficient; according to the preset number of hash buckets and the data to be processed corresponding to the target scene The numerical totals and undetermined fit coefficients generate a preset fit function.

示例性的，预设拟合函数的表示如下式(6)所示：Exemplarily, the expression of the preset fitting function is shown in the following formula (6):

其中，K表示目标场景对应的待处理数据去重总数；b表示哈希分桶数；a₁,a₂,a₃为待定拟合系数。待定拟合系数可以根据样本数据对拟合函数的拟合计算得到。Among them, K represents the total number of deduplication data to be processed corresponding to the target scene; b represents the number of hash buckets; a₁ , a₂ , and a₃ are undetermined fitting coefficients. The undetermined fitting coefficient can be calculated by fitting the fitting function to the sample data.

在本申请的一个实施例中，样本数据对拟合函数的拟合计算包括：随机生成待定拟合系数的初始值；通过待定拟合系数为初始值的拟合函数对样本数据进行计算，得到样本数据的预测数量期望；根据样本数据的预测数量期望与样本数据的实际数量期望之间的差值，调整待定拟合系数的初始值，直至差值小于预设阈值，得到待定拟合系数的数值。In an embodiment of the present application, the fitting calculation of the sample data to the fitting function includes: randomly generating an initial value of the undetermined fitting coefficient; Predicted quantity expectation of sample data; according to the difference between the predicted quantity expectation of sample data and the actual quantity expectation of sample data, adjust the initial value of the undetermined fitting coefficient until the difference is less than the preset threshold, and obtain the undetermined fitting coefficient. numerical value.

首先生成待定拟合系数a₁,a₂,a₃的初始值，然后将样本数据代入具有初值的待定拟合系数，计算得到样本数据对应的预测数量期望。样本数据包括样本哈希分桶数和目标场景对应的样本数据去重总数，也就是将待定拟合系数初始值、样本哈希分桶数和目标场景对应的样本数据去重总数代入上式(6)，得到样本数据对应的预测数量期望。在预设拟合函数构建阶段，样本数据对应的实际数量期望已知，接下来根据样本数据对应的预测数量期望和实际数量期望之间的差值来调整待定拟合系数的初始值，继而根据调整初始值后的拟合函数对样本数据进行计算，得到新的预设数量期望，并计算期望差值，及根据差值调整待定拟合系数的数值。如此循环计算，直至差值小于预设阈值，得到待定拟合系数的目标数值。基于待定拟合系数的目标数值，就得到了预设拟合函数。First, the initial values of the undetermined fitting coefficients a₁ , a₂ , and a₃ are generated, and then the sample data are substituted into the undetermined fitting coefficients with the initial values, and the expected quantity of predictions corresponding to the sample data is calculated. The sample data includes the number of sample hash buckets and the total number of sample data deduplication corresponding to the target scene, that is, the initial value of the undetermined fitting coefficient, the number of sample hash buckets and the total number of sample data deduplication corresponding to the target scene are substituted into the above formula ( 6), obtain the expected number of predictions corresponding to the sample data. In the construction stage of the preset fitting function, the actual quantity expectation corresponding to the sample data is known, and then the initial value of the undetermined fitting coefficient is adjusted according to the difference between the predicted quantity expectation corresponding to the sample data and the actual quantity expectation. The fitting function after adjusting the initial value calculates the sample data, obtains a new preset quantity expectation, calculates the expected difference, and adjusts the value of the undetermined fitting coefficient according to the difference. This cycle is calculated until the difference is less than the preset threshold, and the target value of the undetermined fitting coefficient is obtained. Based on the target values of the undetermined fitting coefficients, a preset fitting function is obtained.

将待定拟合系数的目标数值、预设哈希分桶数和目标场景对应的待处理数据去重总数代入式(6)，得到本申请实施例中的目标叠加数据的数量期望。Substitute the target value of the undetermined fitting coefficient, the preset hash bucket number, and the total number of pending data deduplication corresponding to the target scene into Equation (6) to obtain the expected quantity of target overlay data in the embodiment of the present application.

示例性的，图6示意性地示出了通过线性拟合方式构建预设拟合函数的图形。可以看出，预设拟合函数的拟合值与实际值之间的差值非常小，因此，可以将预设拟合函数的拟合值作为目标叠加数据的数量期望。Exemplarily, FIG. 6 schematically shows a graph of constructing a preset fitting function by means of linear fitting. It can be seen that the difference between the fitting value of the preset fitting function and the actual value is very small. Therefore, the fitting value of the preset fitting function can be used as the expected quantity of target overlay data.

步骤370、根据非目标数据的数值期望和非目标数据的数量期望的乘积，生成哈希算法对应的非目标数据。Step 370: Generate non-target data corresponding to the hash algorithm according to the product of the expected numerical value of the non-target data and the expected quantity of the non-target data.

具体的，将数值期望与数量期望相乘，得到非目标数据，如下式(7)所示：Specifically, multiply the numerical expectation and the quantitative expectation to obtain non-target data, as shown in the following formula (7):

非目标数据表示在目标叠加数据中，与设定指标不匹配的待处理数据之和，即在TOPN哈希桶对应的目标叠加数据中，不属于TOPN的待处理数据的和。Non-target data represents the sum of pending data that does not match the set index in the target overlay data, that is, the sum of pending data that does not belong to TOPN in the target overlay data corresponding to the TOPN hash bucket.

步骤380、根据各个哈希算法对应的目标叠加数据与非目标数据之差，得到各个哈希算法对应的多个待处理数据中与设定指标匹配的数据。Step 380 , according to the difference between the target superimposed data and the non-target data corresponding to each hash algorithm, obtain data that matches the set index among the plurality of data to be processed corresponding to each hash algorithm.

具体的，将目标叠加数据减去非目标数据，得到与设定指标匹配的目标数据，也就是TOPN待处理数据之和，即TOPN累加。Specifically, the target superimposed data is subtracted from the non-target data to obtain the target data matching the set index, that is, the sum of the TOPN data to be processed, that is, the TOPN accumulation.

步骤390、对各个哈希算法对应的多个待处理数据中与设定指标匹配的数据进行统计处理，得到目标场景对应的数据处理结果。Step 390: Perform statistical processing on the data matching the set index among the plurality of data to be processed corresponding to each hash algorithm to obtain the data processing result corresponding to the target scene.

具体的，对各个哈希算法的TOPN累加求期望，得到目标场景的待处理数据中TOPN累加，如下式(8)所示：Specifically, the TOPN accumulation expectation of each hash algorithm is calculated, and the TOPN accumulation in the to-be-processed data of the target scene is obtained, as shown in the following formula (8):

其中，H表示哈希函数总数，相当于哈希算法总数。E_h∈H表示对所有哈希算法的结果求期望，E_h∈H之后括号内的内容为某一个哈希算法的计算内容。Among them, H represents the total number of hash functions, which is equivalent to the total number of hash algorithms. E_{h ∈ H} represents the expectation of the results of all hash algorithms, and the content in parentheses after E_{h ∈ H} is the calculation content of a certain hash algorithm.

本申请实施例提供的技术方案通过预设数量的哈希桶来对待处理数据进行叠加存储，由于哈希桶所占用的存储空间一定，因此这种数据存储方式无需对待处理数据进行逐条存储，而只存储中间计算所得的叠加数据，故而达到了通过有限存储资源进行大量头部指标数据计算的效果。同时，在叠加数据的存储过程中，未对待处理数据进行舍弃，相当于保留了待处理数据的全局信息，可以将待处理数据较为分散但叠加数据较大的数据纳入考虑范围，使得指标计算所使用的数据更加全面，从而即使减少了存储资源的占用，计算精度也不会下降。The technical solutions provided by the embodiments of the present application use a preset number of hash buckets to superimpose and store the data to be processed. Since the storage space occupied by the hash buckets is certain, this data storage method does not need to store the data to be processed one by one. Only the superimposed data obtained by the intermediate calculation is stored, so the effect of calculating a large amount of head indicator data through limited storage resources is achieved. At the same time, in the process of storing the superimposed data, the data to be processed is not discarded, which is equivalent to retaining the global information of the data to be processed, and the data with relatively scattered data to be processed but large superimposed data can be taken into consideration, making the index calculation The data used is more comprehensive, so that the calculation accuracy will not decrease even if the storage resource usage is reduced.

应当注意，尽管在附图中以特定顺序描述了本申请中方法的各个步骤，但是，这并非要求或者暗示必须按照该特定顺序来执行这些步骤，或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的，可以省略某些步骤，将多个步骤合并为一个步骤执行，以及/或者将一个步骤分解为多个步骤执行等。It should be noted that although the various steps of the methods of the present application are depicted in the figures in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps must be performed to achieve the desired the result of. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and the like.

以下介绍本申请的装置实施例，可以用于执行本申请上述实施例中的数据处理方法。图7示意性地示出了本申请实施例提供的数据处理装置的结构框图。如图7所示，本申请实施例提供的数据处理装置包括：The apparatus embodiments of the present application are introduced below, which can be used to execute the data processing methods in the above-mentioned embodiments of the present application. FIG. 7 schematically shows a structural block diagram of a data processing apparatus provided by an embodiment of the present application. As shown in FIG. 7 , the data processing apparatus provided by the embodiment of the present application includes:

哈希运算模块710，用于通过至少一种哈希算法对目标场景中的各个待处理数据进行哈希运算处理，得到各个待处理数据对应的至少一个哈希值；Hash operation module 710, configured to perform hash operation processing on each data to be processed in the target scene through at least one hash algorithm, to obtain at least one hash value corresponding to each data to be processed;

数据叠加模块720，用于根据所述待处理数据对应的至少一个哈希值，将所述待处理数据叠加至预设数据列表中与所述哈希算法和所述哈希值对应位置处的数据，得到各个哈希算法对应的叠加数据；Adata superimposition module 720 is configured to superimpose the to-be-processed data to the position corresponding to the hash algorithm and the hash value in the preset data list according to at least one hash value corresponding to the to-be-processed data. data to obtain the superimposed data corresponding to each hash algorithm;

数据计算模块730，用于对所述预设数据列表中与各个哈希算法对应的叠加数据进行排序，根据排序结果获取对应于所述各个哈希算法、且与设定指标匹配的目标叠加数据，并根据所述各个哈希算法对应的叠加数据中除所述目标叠加数据之外的其它叠加数据，生成所述各个哈希算法对应的非目标数据；Thedata calculation module 730 is used to sort the superimposed data corresponding to each hash algorithm in the preset data list, and obtain the target superimposed data corresponding to the each hash algorithm and matching the set index according to the sorting result , and generate non-target data corresponding to each hash algorithm according to other superimposed data other than the target superimposed data in the superimposed data corresponding to each hash algorithm;

结果生成模块740，用于根据所述目标叠加数据和所述非目标数据，计算所述多个待处理数据中与所述设定指标匹配的数据，以得到所述目标场景对应的数据处理结果。Aresult generation module 740, configured to calculate the data matching the set index in the plurality of data to be processed according to the target overlay data and the non-target data, so as to obtain a data processing result corresponding to the target scene .

在本申请的一个实施例中，所述待处理数据包括键值对形式的数据；哈希运算模块710具体用于：通过至少一种哈希算法对目标场景中的各个待处理数据的键进行哈希运算处理；In an embodiment of the present application, the data to be processed includes data in the form of key-value pairs; thehash operation module 710 is specifically configured to: perform at least one hash algorithm on the keys of each data to be processed in the target scene. Hash operation processing;

数据叠加模块720具体用于：根据所述待处理数据对应的至少一个哈希值，将所述待处理数据的值叠加至预设数据列表中与所述哈希算法和所述哈希值对应位置处的数据。Thedata superposition module 720 is specifically configured to: according to at least one hash value corresponding to the data to be processed, superimpose the value of the data to be processed into a preset data list corresponding to the hash algorithm and the hash value data at the location.

在本申请的一个实施例中，所述哈希算法包括哈希函数运算和取模运算；哈希运算模块710包括：In an embodiment of the present application, the hash algorithm includes a hash function operation and a modulo operation; thehash operation module 710 includes:

在本申请的一个实施例中，所述哈希算法对应的叠加数据包括多个哈希桶存储的叠加数据，一个哈希桶表示所述预设数据列表中的一个存储位置；数据计算模块730包括：In an embodiment of the present application, the superimposed data corresponding to the hash algorithm includes superimposed data stored in multiple hash buckets, and one hash bucket represents a storage location in the preset data list; thedata calculation module 730 include:

在本申请的一个实施例中，数据计算模块730包括：In an embodiment of the present application, thedata calculation module 730 includes:

在本申请的一个实施例中，结果生成模块740包括：In one embodiment of the present application, theresult generation module 740 includes:

本申请各实施例中提供的数据处理装置的具体细节已经在对应的方法实施例中进行了详细的描述，此处不再赘述。The specific details of the data processing apparatus provided in each embodiment of the present application have been described in detail in the corresponding method embodiment, and are not repeated here.

图8示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。FIG. 8 schematically shows a structural block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

需要说明的是，图8示出的电子设备的计算机系统800仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that thecomputer system 800 of the electronic device shown in FIG. 8 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.

如图8所示，计算机系统800包括中央处理器801(Central Processing nit，CPU)，其可以根据存储在只读存储器802(Read-Only Memory，ROM)中的程序或者从存储部分808加载到随机访问存储器803(Random Access Memory，RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器803中，还存储有系统操作所需的各种程序和数据。中央处理器801、在只读存储器802以及随机访问存储器803通过总线804彼此相连。输入/输出接口805(Input/Output接口，即I/O接口)也连接至总线804。As shown in FIG. 8 , thecomputer system 800 includes a central processing unit 801 (Central Processing nit, CPU), which can be loaded into a random device according to a program stored in a read-only memory 802 (Read-Only Memory, ROM) or from astorage part 808 Various appropriate operations and processes are executed by accessing the program in the memory 803 (Random Access Memory, RAM). In therandom access memory 803, various programs and data necessary for system operation are also stored. Thecentral processing unit 801 , the read-only memory 802 and therandom access memory 803 are connected to each other through abus 804 . An input/output interface 805 (Input/Output interface, ie, I/O interface) is also connected to thebus 804 .

以下部件连接至输入/输出接口805：包括键盘、鼠标等的输入部分806；包括诸如阴极射线管(Cathode Ray Tube，CRT)、液晶显示器(Liquid Crystal Display，LCD)等以及扬声器等的输出部分807；包括硬盘等的存储部分808；以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至输入/输出接口805。可拆卸介质811，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器810上，以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the input/output interface 805: aninput section 806 including a keyboard, a mouse, etc.; anoutput section 807 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc. ; astorage section 808 including a hard disk, etc.; and acommunication section 809 including a network interface card such as a local area network card, a modem, and the like. Thecommunication section 809 performs communication processing via a network such as the Internet. Adriver 810 is also connected to the input/output interface 805 as required. Aremovable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on thedrive 810 as needed so that a computer program read therefrom is installed into thestorage section 808 as needed.

特别地，根据本申请的实施例，各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如，本申请的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分809从网络上被下载和安装，和/或从可拆卸介质811被安装。在该计算机程序被中央处理器801执行时，执行本申请的系统中限定的各种功能。In particular, according to the embodiments of the present application, the processes described in the flowcharts of the respective methods may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via thecommunication portion 809, and/or installed from theremovable medium 811. When the computer program is executed by thecentral processing unit 801, various functions defined in the system of the present application are executed.

需要说明的是，本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory，CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium, other than a computer-readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本申请的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本申请实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。From the description of the above embodiments, those skilled in the art can easily understand that the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present application may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , which includes several instructions to cause a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses or adaptations of this application that follow the general principles of this application and include common knowledge or conventional techniques in the technical field not disclosed in this application .

应当理解的是，本申请并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A data processing method, comprising:

performing hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed;

according to at least one hash value corresponding to the data to be processed, overlapping the data to be processed to data in a preset data list at positions corresponding to the hash algorithm and the hash value to obtain overlapped data corresponding to each hash algorithm;

sorting the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data which corresponds to each hash algorithm and is matched with a set index according to a sorting result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm;

and calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data so as to obtain a data processing result corresponding to the target scene.

2. The data processing method of claim 1, wherein the data to be processed comprises data in the form of key-value pairs;

the hash operation processing of each data to be processed in the target scene through at least one hash algorithm includes: carrying out hash operation processing on keys of each data to be processed in the target scene through at least one hash algorithm;

the superimposing, according to the at least one hash value corresponding to the data to be processed, the data to be processed onto data in a preset data list at a position corresponding to the hash algorithm and the hash value includes: and according to at least one hash value corresponding to the data to be processed, superposing the value of the data to be processed to data at a position corresponding to the hash algorithm and the hash value in a preset data list.

3. The data processing method of claim 2, wherein the hash algorithm comprises a hash function operation and a modulo operation; the performing hash operation processing on the key of each piece of data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each piece of data to be processed includes:

performing hash calculation on the key of each piece of data to be processed in the target scene through at least one hash function operation to obtain at least one hash result of each piece of data to be processed;

performing modular operation on at least one hash result of each piece of data to be processed aiming at a preset hash bucket number, and taking the result of the modular operation as at least one hash value corresponding to each piece of data to be processed; and the preset Hash bucket number is used for indicating the size of the storage space occupied by the preset quantity list.

4. The data processing method according to claim 1, wherein the superimposed data corresponding to the hash algorithm includes superimposed data stored in a plurality of hash buckets, and one hash bucket represents one storage location in the preset data list; acquiring target superposition data which correspond to the hash algorithms and are matched with set indexes according to the sorting result, wherein the target superposition data comprises the following steps:

acquiring the superposed data stored in a set number of hash buckets corresponding to the hash algorithms and matched with set indexes according to the sorting result;

and summing the superposed data stored in the hash buckets with the set number to obtain target superposed data corresponding to each hash algorithm.

5. The data processing method according to claim 1, wherein generating non-target data corresponding to each hash algorithm according to other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm comprises:

generating a numerical expectation of non-target data according to other superimposed data except the target superimposed data in the superimposed data corresponding to the Hash algorithm and the data amount to be processed corresponding to the other superimposed data;

calculating the quantity expectation of non-target data according to the data quantity to be processed corresponding to the target superposition data;

and generating the non-target data corresponding to the hash algorithm according to the product of the numerical value expectation of the non-target data and the quantity expectation of the non-target data.

6. The data processing method according to claim 5, wherein generating a numerical expectation of non-target data according to other superimposed data except the target superimposed data in the superimposed data corresponding to the hash algorithm and the amount of data to be processed corresponding to the other superimposed data comprises:

summing other superposed data except the target superposed data in the superposed data corresponding to the Hash algorithm to obtain non-target superposed data corresponding to the Hash algorithm;

determining the duplication eliminating number of the data to be processed corresponding to the other superposed data according to the data amount to be processed corresponding to the other superposed data;

and obtaining the numerical expectation of the non-target data according to the ratio of the duplication removing numbers of the data to be processed corresponding to the non-target superposed data and the other superposed data.

7. The data processing method of claim 6, wherein the data to be processed comprises data in the form of key-value pairs; determining the duplication eliminating number of the data to be processed corresponding to the other superimposed data according to the data amount to be processed corresponding to the other superimposed data, including:

according to keys of the data to be processed, carrying out duplicate removal processing on a plurality of data to be processed in the target scene to obtain a total number of duplicate removal of the data to be processed;

according to keys of the data to be processed, carrying out duplicate removal processing on a plurality of data to be processed corresponding to the target superposed data to obtain a duplicate removal number of the data to be processed corresponding to the target superposed data;

and obtaining the duplication removing number of the data to be processed corresponding to other superposition data according to the difference value between the duplication removing total number of the data to be processed and the duplication removing number of the data to be processed corresponding to the target superposition data.

8. The data processing method of claim 5, wherein calculating the quantity expectation of the non-target data according to the data amount to be processed corresponding to the target superposition data comprises:

generating the quantity expectation of the target superimposed data according to a preset Hash bucket number, the data deduplication total number corresponding to the target scene and a preset fitting function;

and obtaining the quantity expectation of the non-target data according to the difference between the quantity expectation of the target superposed data and the value of the set index.

9. The data processing method according to claim 8, wherein before generating the desired amount of target superimposed data according to a preset hash bucket number, a total deduplication count of to-be-processed data corresponding to the target scene, and a preset fitting function, the method further comprises:

according to the fitting coefficient to be determined, constructing a fitting function related to the hash bucket number and the data deduplication total number to be processed;

training the fitting function through sample data to obtain a target numerical value of the fitting coefficient to be determined; the sample hash bucket number and the sample data deduplication total number corresponding to the target scene;

and generating the preset fitting function according to the target numerical value of the fitting coefficient to be determined.

10. The data processing method of claim 9, wherein training the fitting function with sample data to obtain a target value of the pending fit coefficient comprises:

randomly generating an initial value of the fitting coefficient to be determined;

calculating sample data through a fitting function with the undetermined fitting coefficient as an initial value to obtain the expected predicted quantity of the sample data;

and adjusting the initial value of the to-be-determined fitting coefficient according to the difference between the predicted quantity expectation of the sample data and the actual quantity expectation of the sample data until the difference is smaller than a preset threshold value, so as to obtain a target numerical value of the to-be-determined fitting coefficient.

11. The data processing method according to any one of claims 1 to 10, wherein calculating data matching the set index in the plurality of data to be processed according to the target superposition data and the non-target data to obtain a data processing result corresponding to the target scene includes:

obtaining data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm according to the difference between the target superposition data corresponding to each hash algorithm and the non-target data;

and performing statistical processing on data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm to obtain a data processing result corresponding to the target scene.

12. The data processing method according to claim 11, wherein performing statistical processing on data matching the set index in the plurality of data to be processed corresponding to the respective hash algorithms to obtain a data processing result corresponding to the target scene includes:

and calculating expected values of data matched with the set indexes in the plurality of data to be processed corresponding to the hash algorithms to serve as data processing results corresponding to the target scene.

13. A data processing apparatus, comprising:

the hash operation module is used for carrying out hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed;

the data superposition module is used for superposing the data to be processed to data at positions corresponding to the hash algorithms and the hash values in a preset data list according to at least one hash value corresponding to the data to be processed to obtain superposed data corresponding to each hash algorithm;

the data calculation module is used for sorting the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data which correspond to each hash algorithm and are matched with a set index according to a sorting result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm;

and the result generation module is used for calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data so as to obtain a data processing result corresponding to the target scene.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 12.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein execution of the executable instructions by the processor causes the electronic device to perform the data processing method of any of claims 1 to 12.