serial number	User ID	Gender of user	Transaction ID	Commodity ID	Amount of transaction
						1	123	For male	1001	001	5
2	124	For male	1002	003	15
						3	125	Woman	1003	005	15
4	126	For male	1004	007	10
						5	127	Woman	1005	009	5
6	128	Woman	1006	011	5
						7	129	Woman	1007	013	5
8	130	Woman	1008	015	10
						9	131	For male	1009	017	5
10	133	Woman	1010	019	5

The database table of table 1 is only an example and is not limited to the contents of table 1. It includes several data items of serial number, user ID, user's sex, transaction ID, commodity ID and transaction amount. In serial numbers 1-10, a row of records corresponding to each serial number is a record item, and the record item is a transaction record of a user. Each entry includes the aforementioned several data items. And, data relating to the transaction of the user is stored in each data item. The data may be numerical values.

At step S120, the duplicate data in the one or more data is counted to determine a count value for each duplicate data. Further, a count value table can be generated.

The counting of the duplicated data is performed during the full-table scanning process of the database table, i.e. while scanning one or more data stored in the database table (step S110), the duplicated data is counted and the count values are recorded.

Duplicate data refers to identical data or data that appears repeatedly, the identical data or data that appears repeatedly having the same value (numerical value). The count value is the number of repetitions of repeated data in one or more data. In one or more entries, the same data stored in one or more data items are counted based on the same data item to obtain a count value of the same data. That is, there may be duplication of recorded data in a database table, i.e., data having the same value may be repeated one or more times in the database table.

Further, in the whole table scanning process, the number of times that a certain data item in the corresponding data item has the same numerical value as the data item repeatedly appears may be obtained and counted through scanning, and a corresponding count value table is generated. As shown in table 2, a table in the form of "value-count value/number" of a certain data item is generated, i.e. a count value table:

for example: data item A

Value 1, count 10 times;

number 2, count 5 times;

……。

thus, by performing the counting and/or aggregation processing during the scanning period described in step S120, the database can be compressed. Entries containing numerically repeated records may be merged when there is a large amount of data in a database table. For example, if scores of 10 ten thousand examinees in a national examination are recorded in a database table, the database table may be compressed according to the scores (data items) to obtain a count value table, and if the scores are in percent, the count value table only includes a count value of each score from 0 to 100 points, and a total of 100 records are obtained, that is, the database table including 10 ten thousand record items is compressed into a count value table including 100 record items, and the compression ratio is 1000:1, so that the data amount is effectively reduced. The data volume is reduced for the post operation processing, and the calculation burden and the processing pressure of a computer are reduced. And, the process completes the generation of the count and count value table in the process of scanning the whole table, and does not generate additional IO, i.e. IO is 0. The whole is scanned only 1 time (step S110, IO is 1).

In table 1, the data item "transaction amount" is used as a reference, and the transaction amount 15 appears twice during scanning. The number of repeated occurrences of each identical data may be counted during the scanning of one or more data recorded in the database table to obtain the number of repeated occurrences (count value) of each data, and thus obtain the count value table.

Table 2 shows a count table of the count values of the transaction amounts in table 1.

Table 2:

amount of transaction

Count value

5	6
		10	2
15	2

That is, in the record item of the whole table in table 1, the data item "transaction amount", and the recorded data is the numerical value of "5", which occurs 6 times; data for a transaction amount of 10 occurred 2 times; data for a transaction amount of 15 appears 2 times. Since each entry has a transaction amount data item, that is, data of a transaction amount, it means that there are 6 entries of 5 transaction amounts, 2 entries of 10 transaction amounts, and 2 entries of 15 transaction amounts in table 1.

At step S130, a result related to analyzing the data distribution is calculated from each of the duplicated data and the count value of each of the duplicated data.

The duplicate data may be further processed based on the count value of each duplicate data, so as to obtain a series of results (parameters), and the distribution of one or more data recorded in the database table may be reflected according to the series of results (parameters). For example, the series of results (parameters) may reflect the peak of the data distribution in one or more data sets.

The results (parameters) associated with analyzing the data distribution may include mean, maximum, minimum, quantile, mode, and the like. Wherein the mean is an average of one or more data. The maximum value refers to the largest data among one or more data. The minimum value means the smallest data among one or more data. The mode is the data that is repeated the most number of times among one or more data. The quantile α is a value in the interval (0, 1). The quantile α can correspond to one of one or more data, and among the one or more data, there is data whose proportion is that the quantile α is smaller than the data corresponding to the quantile α.

The repeated data recorded in the count value table and the count value corresponding to the repeated data may be processed by scanning the count value table generated in step S120, so as to calculate a mean value, a maximum value, a minimum value, and a mode of one or more data recorded in the database table. Meanwhile, in the scanning process of the count value table, the total number of record items of the whole table (i.e. how many records or the number of records) and the total value (the sum of data in a certain data item) can be calculated. At this time, the scan, IO, has been greatly compressed. In the example of the examinee score in step S120, IO in the scan count value table is only 0.001 as it is.

While scanning each duplicate data and the count value of each duplicate data, the computer system may calculate a total number of one or more data recorded by a data item in the full table and a total count value of the count values of the one or more duplicate data, thereby obtaining a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data.

Specifically, by scanning the count value table, the total number of record items in the database table, i.e. the total number of records (a plurality of records in total), can be determined in the count value table, and the sum of one or more data of the records can be calculated, so that the average value of one or more data of the records in the database table can be obtained: sum of data/total number of records = mean. Wherein, the total record number may be the total number of the count values of the respective repeated data, i.e. the total count value, which represents the total number of the record items; the sum of the data may be obtained by calculating the products of each of the duplicate data and its count value, and summing the products. And the maximum value and the minimum value can be obtained by pairwise comparing one or more repeated data in the count value table, and the maximum value and the minimum value are the maximum value and the minimum value in one or more data in the database table. And obtaining the repeated data with the largest count value according to the count value of each repeated data in the count value table, wherein the repeated data with the largest count value is the mode of one or more data in the database table.

Further, by combining the mean value of the data obtained by the calculation, the repeated data and the count value thereof recorded in the count value table can be scanned again, and the variance and standard deviation of the data recorded in the database table can be obtained. If the same table of compressed count values is scanned at this time, the IO is extremely small, such as 0.001 IO in the above example of student performance.

Taking the count value table shown in table 2 as an example, scanning the count value table results in that the total number of entries in the database table is 10, i.e., the sum of the respective count values (6 +2+2= 10), the total amount of transaction amount is 80 (the sum of data of transaction amount), i.e., the sum of the products of the respective count values and the respective count values (5 × 6+10 × 2+15 × 2= 80), and then the average value of the transaction amount in table 2 is 8 (80/10 = 8). The transaction amount 5, the transaction amount 10 and the transaction amount 15 are compared pairwise to obtain the maximum value 15 and the minimum value 5. Because the mode of the one or more data is the data with the largest count value, the largest count value in table 2 is the transaction amount 5, which is a count value of 6.

The process of scanning the data recorded in the count value table and the count value corresponding to the data, calculating the quantile to generate a quantile distribution table, and further determining the data or the repeated data in a certain data item corresponding to any given quantile is shown in fig. 2. FIG. 2 is a flow chart of the steps of calculating quantiles according to an embodiment of the present application.

At step S210, when scanning each duplicate data and the count value of each duplicate data, the one or more duplicate data and the count value of the one or more duplicate data may also be sorted according to the size of each duplicate data. For example, in the process of scanning the count value table to calculate the number of entries in step S130, the repeated data in the count value table may be sorted.

One or more repeated data and a count value corresponding to each repeated data are contained in the count value table. The one or more duplicate data is ordered according to the size of the one or more duplicate data. The one or more data are sorted, for example, in an ascending manner.

As shown in table 2, the transaction amount 5 is smaller than the transaction amount 10, the transaction amount 10 is smaller than the transaction amount 15, and the transaction amounts and the count values of the transaction amounts are sorted according to the transaction amounts.

Here, the count value may indicate a certain data (value) of the transaction amount of how many entries are, and may indicate how many transactions are the amount, taking a transaction as an example. That is, the transaction amount of 5 is 6, the transaction amount of 10 is 2, and the transaction amount of 15 is 2.

At step S220, based on the sorting order, a cumulative value of each duplicate data is obtained according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data. Thereby obtaining an accumulated value table. Wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data.

For example: as shown in table 3, a table of accumulated values obtained based on the transaction amount and the count value recorded in table 2. The accumulated value table shown in table 3 includes the transaction amounts and the count values in table 2, and the accumulated value corresponding to each transaction amount is obtained based on the count values in table 2. The transaction amounts are 5, 10 and 15 in ascending order, and the respective count values are 6, 2 and 2 in sequence.

Table 3:

amount of transaction

Count value

Cumulative value

Total count value

5	6	6	10
				10	2	8	10
15	2	10	10

Further, table 3 may also include that, from the respective count values recorded in table 2, a total count value (6 +2+2= 10) is obtained.

At step S230, a quantile interval in which each duplicate data is located is obtained according to the accumulated value and the count value of each duplicate data.

The quantile interval in which each duplicate data is located refers to which quantile of each duplicate data or the same data(s) is within the interval made up of which two quantiles, or it can be interpreted between which quantiles one or more data of the same value is located. The quantile interval may include a starting quantile, an ending quantile. The starting quantile is the left end point of the interval, and the ending quantile is the right end point of the interval.

Specifically, scanning the data recorded in the cumulative value table, as shown in table 3, can analyze the distribution of quantiles of one or more data recorded in the database table. Further, the starting quantile and the ending quantile of each repeated data can be calculated according to the counting value, the accumulated value and the total counting value of each repeated data. Further, a formula may be utilized to calculate a starting quantile and an ending quantile for obtaining each duplicate data. The IO that scans the accumulation table is also minimal.

Starting quantile = (accumulated value-count value)/total count value.

End quantile = cumulative value/total count value.

Further, according to the quantile interval where each piece of repeated data is located, data corresponding to any quantile can be calculated, and the any quantile can be any value. Specifically, a quantile interval in which the arbitrary quantile is located can be determined, and the repeated data corresponding to the arbitrary quantile can be determined according to the corresponding relationship between the quantile interval and the repeated data.

Through the method, the repeated data corresponding to any quantile can be determined, so that subsequent data mining can be performed on one or more data in the database table.

Taking table 3 as an example: the starting quantile of the transaction amount 5 is (6-6)/10 =0, the ending quantile is 6/10=0.6, then the quantile interval in which the transaction amount 5 is located is (0, 0.6 ]. the starting quantile of the transaction amount 10 is (8-2)/10 =0.6, the ending quantile is 8/10=0.8, then the quantile interval in which the transaction amount 10 is located is (0.6, 0.8 ]. the starting quantile of the transaction amount 15 is (10-2)/10 =0.8, the ending quantile is 10/10=1, then the quantile interval in which the transaction amount 15 is located is (0.8, 1).

From this, a result table, i.e., a quantile distribution table, can be obtained as shown in Table 4. Therefore, the repeated data corresponding to any quantile can be determined, so that the speed and the efficiency of processing such as mining, searching and the like of one or more data can be further improved.

TABLE 4

Amount of transaction	Starting quantile	Number of terminated quantiles
			5	0	0.6
10	0.6	0.8
			15	0.8	1

Scanning the quantile distribution table, such as table 4, may calculate and determine the duplicate data of a certain data item corresponding to any given quantile according to the region where the quantile α is distributed. The scan IO is still minimal. For example: the quantile interval in which the transaction amount 5 is located is (0, 0.6), the quantile interval in which the transaction amount 10 is located is (0.6, 0.8), the quantile interval in which the transaction amount 15 is located is (0.8, 1 ]. when the repeated data corresponding to the quantile α =0.75 needs to be calculated, it can be determined from table 4 that the quantile interval in which the quantile 0.75 is located is (0.6, 0.8), and the repeated data corresponding to the quantile interval (0.6, 0.8) is 10, the repeated data corresponding to the quantile 0.75 is 10, and thus the value of the transaction amount related to the quantile is 10.

As can be seen from step S130, in the present application, after the full table is scanned for 1 time in step S110, i.e. IO is 1, only the count value table, the cumulative value table, and the quantile distribution table need to be scanned, and since information in the count value table and the like is compressed in a large proportion with respect to information in the database table, each IO is much smaller than 1, so that the operation is completed, and the mean value, the maximum value, the minimum value, the mode, the quantile, and the repeated data (data) and the like corresponding to the quantile can be obtained.

The correlation results obtained by the data processing method, such as the mean value, the maximum value, the minimum value, the mode, the quantile and the like, can be used for analyzing the distribution condition of the data, and further data mining can be carried out. For example: in large-scale electronic commerce, massive commodity transaction information stored in a database can analyze the purchasing preference of a user by analyzing the data distribution condition of transaction amount in transaction details (database tables), so that the using effect of the user can be improved according to the purchasing preference of the user.

According to an aspect of the present application, a data processing apparatus is also provided. As shown in fig. 3, fig. 3 is a block diagram of a data processing apparatus 300 according to an embodiment of the present application.

In the apparatus 300: the description module 310 may be used to scan one or more data stored in a database table. The database table records one or more record items, wherein one record item represents one record, and each row of records in the database table records each record item; each record item comprises one or more data items, each column of records in a database table records each different data item; each data item stores the data corresponding to the data item in each record item; wherein the data is a numerical value. The scanning module 310 may be implemented in step S110.

The counting module 320 may be configured to count duplicate data in the one or more data based on the scan to determine a count value for each duplicate data. The counting module 320 may also be configured to: the same data stored by one of the one or more data items of the database table is counted. The specific implementation of the counting module 320 can be seen in step S120.

The calculation module 330 may be configured to calculate a result associated with analyzing the data distribution based on each of the duplicate data and the count value of each of the duplicate data. Results associated with analyzing the data distribution, including at least: one data item in the one or more data items of the database table corresponds to a mean, a maximum, a minimum, a quantile, a mode of the one or more stored data. The calculation module 330 is configured to: while scanning each repeated data and the count value of each repeated data, calculating a total count value of the total number of the one or more data and the count value of the one or more repeated data to obtain a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data. The specific implementation of the calculation module 330 can be seen in step S130.

The calculation module 330 may further include: a sorting sub-module 331, an accumulation sub-module 332, and an obtaining sub-module 333.

The sorting submodule 331 may be configured to sort the one or more duplicate data and the count value of the one or more duplicate data according to a size of each duplicate data when scanning each duplicate data and the count value of each duplicate data. The specific implementation of the sorting sub-module 331 can be seen in step S210.

The accumulation submodule 332 may be configured to obtain an accumulated value of each duplicate data according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data based on the order formed by the sorting; wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data. The specific implementation of the accumulation sub-module can be seen in step S220.

The obtaining sub-module 333 may be configured to obtain a quantile interval where each piece of repeated data is located according to the accumulated value and the count value of each piece of repeated data. A specific implementation of the obtaining sub-module may be seen in step S230.

The determining module 340 may be configured to determine the repeated data corresponding to any quantile according to a quantile interval in which the any quantile is located, where any quantile may be any value between 0 and 1.

Since the specific embodiments of the various modules included in the apparatus of the present application described in fig. 3 correspond to the specific embodiments of the steps in the method of the present application, the specific details of the various modules will not be described here in order not to obscure the present application, since fig. 1-2 have already been described in detail.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.