Movatterモバイル変換


[0]ホーム

URL:


CN104657388A - Data processing method and device - Google Patents

Data processing method and device
Download PDF

Info

Publication number
CN104657388A
CN104657388ACN201310597967.8ACN201310597967ACN104657388ACN 104657388 ACN104657388 ACN 104657388ACN 201310597967 ACN201310597967 ACN 201310597967ACN 104657388 ACN104657388 ACN 104657388A
Authority
CN
China
Prior art keywords
data
count value
value
duplicate
repeated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310597967.8A
Other languages
Chinese (zh)
Inventor
吕春建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding LtdfiledCriticalAlibaba Group Holding Ltd
Priority to CN201310597967.8ApriorityCriticalpatent/CN104657388A/en
Publication of CN104657388ApublicationCriticalpatent/CN104657388A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The application relates to a data processing method and a data processing device. The data processing method includes: scanning one piece or multiple pieces of data stored in a database table; counting the number of pieces of repeated data in the one piece or the multiple pieces of the data based on scanning, and confirming a count value of each piece of the repeated data; calculating a result related to analysis data distribution according to each piece of the repeated data and the count value of each piece of the repeated data. The data processing method and the data processing device which are achieved based on the repeated data avoid scanning on all the data, effectively reduce computer IO, significantly reduce the computation data size, improve computation efficiency, shorten data processing time, and reduce a computer running burden, and furthermore rapidly obtain a quantile and reduce computation cost by using a quantile range. Accordingly, the data processing method and the data processing device achieve the purposes of improving computation performance during the computer data processing process and improving the computation efficiency.

Description

Data processing method and device
Technical Field
The present application relates to the field of computer processing, and more particularly, to a data processing method and apparatus.
Background
In the current big data era, data collection and processing technology is rapidly developed, the quantity of data collected by a computer is huge, and the quantity of record items of a database table for storing data is billions or even billions. Where each entry may be used to store one or more data. The data may be numerical values. To analyze and mine the value of these data (values), a computer is usually used to perform operations or processes on these data (values) to explore the distribution of the data in the first step.
In a conventional probing data distribution process, a database table needs to be scanned multiple times during data processing to respectively calculate calculation results related to data contained in the database table, for example: mean, maximum, minimum, variance, standard deviation, quantile, mode, and the like. Furthermore, the distribution of the data can be analyzed through the calculation results, and the subsequent data mining and the like can be further implemented.
Specifically, in the process of calculating the correlation result, the data in the database table needs to be scanned at least 5 times. For example, when calculating the mean value, the maximum value and the minimum value of the data, the data in the database table needs to be scanned 1 time by the whole table. Secondly, when the variance and the standard deviation of the data are calculated, the data in the database table needs to be scanned for 1 time by the whole table. And thirdly, before calculating quantiles, at least scanning the data in the database table 1 time by the full table. Specifically, when sampling a portion of data in a database table, the data in the database table is scanned 0 to 1 times (e.g., 0.1 times). And dividing one or more continuous data intervals according to the sampled data, wherein each data interval represents a numerical range, such as: [0,5],[6,7],[8,9]. According to the result of the data interval division, a map-reduce method can be adopted to sort the data in the database table. When the data are sorted, the map module is responsible for scanning all the data in the database table for 1 time through the whole table so as to determine the data interval where each data is located, and each of the one or more reduce modules is responsible for sorting the data in one data interval. Since one or more data intervals are continuous, the sorting results of the data within each data interval can be finally merged to obtain the sorting results of all the data. And fourthly, when the quantile is calculated, the data in the database table after 1 or more times of sequencing is scanned to obtain one or more quantiles. And fifthly, when calculating the mode, at least scanning the data in the database table 1 time in a full table mode. Specifically, the data with the same value is analyzed by scanning the data in the database table 1 time, and then the data (mode) with the most repetition times is obtained by scanning each data from 0 to 1 time (for example, 0.1 time). It can be seen that in the conventional data processing, in order to obtain the required result by calculation, the database table is scanned for 5 times or more. The calculation amount is huge.
In addition, with the development of technologies, technologies such as computers and networks are applied to various industries, the data volume of the storage records in the database table is also getting huge, and the existing data processing technology can scan all data in the database table for many times during operation, so that the IO of the computer is extremely high. The IO refers to the amount of data read by a computer when reading data from a peripheral device (disk, network, tape, etc.) to a memory. Further, when data in a database table is sorted, due to an excessively high calculation amount, a large number of sorted entries and a large data amount are also generated, and thus, the operation efficiency of the computer is reduced in the case of the whole sorting of the whole table when the quantiles are calculated. Furthermore, if the value of the data in the database table is concentrated in a certain data interval, that is, the data is seriously inclined, the amount of data to be sorted by the reduce device in charge of the data interval is too large, the data processing time is increased, and even a calculation result cannot be obtained. Furthermore, when the quantiles are calculated, only one quantile can be calculated each time, if a plurality of quantiles are required to be calculated, the sorted data can be traversed and sequenced through multiple times of scanning, and therefore, the computer overhead is large in operation.
Disclosure of Invention
In order to overcome the defects of high computer IO and low computing efficiency when a large amount of data is processed, the present application mainly aims to provide a data processing method and device to solve the problems of improving the operational performance and improving the computing efficiency in the computer data processing process.
The scheme provided by the application further overcomes the defects of low operation efficiency and long data processing time or incapability of obtaining a calculation result due to overlarge operation amount and over-concentrated values of a computer, and solves the problems of improving the operation efficiency and shortening the data processing time; furthermore, the defect of high data reading overhead of the computer caused by large data volume can be overcome, and the problem of saving the data reading overhead of the computer is solved.
In order to solve the technical problem, the purpose of the present application is achieved by the following technical solutions:
a method of data processing, comprising: scanning one or more data stored in a database table; counting duplicate data in the one or more data based on the scanning, determining a count value for each of the duplicate data; and calculating a result related to analyzing data distribution according to the repeated data and the counting value of the repeated data.
The database table records one or more record items, one record item represents one record, and each row of records in the database table records each record item; each record item comprises one or more data items, each column of records in a database table records each different data item; each data item stores the data corresponding to the data item in each record item; wherein the data is a numerical value; counting duplicate data in the one or more data based on the scanning, including: the same data stored by one of the one or more data items of the database table is counted.
Wherein the results related to analyzing the data distribution at least comprise: one data item in the one or more data items of the database table corresponds to a mean, a maximum, a minimum, a quantile, and a mode of the one or more stored data.
Wherein calculating a result related to analyzing data distribution according to the each duplicate data and the count value of each duplicate data comprises: while scanning the each duplicate data and the count value of each duplicate data, calculating a total count value of the total number of the one or more data and the count value of the one or more duplicate data to obtain a mean value of the one or more data; and comparing the size of each duplicate data to obtain the maximum value and the minimum value in the one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data.
Wherein calculating a result related to analyzing data distribution according to the each duplicate data and the count value of each duplicate data comprises: while scanning the each duplicate data and the count value of each duplicate data, sorting the one or more duplicate data and the count value of the one or more duplicate data according to the size of each duplicate data; based on the sequence formed by the sorting, obtaining a cumulative value of each repeated data according to the count value of each repeated data and the count values of all repeated data arranged in front of each repeated data; wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all duplicated data arranged before the each of the duplicated data; and obtaining the quantile interval where each repeated data is located according to the accumulated value and the count value of each repeated data.
Determining repeated data corresponding to any quantile according to the quantile interval in which the any quantile is located, wherein the any quantile can be any value between 0 and 1.
A data processing apparatus comprising: a description module to scan one or more data stored in a database table; a counting module configured to count duplicate data of the one or more data based on the scanning, and determine a count value of each of the duplicate data; and the calculation module is used for calculating a result related to the analysis of data distribution according to the repeated data and the count value of the repeated data.
The database table records one or more record items, one record item represents one record, and each row of records in the database table records each record item; each record item comprises one or more data items, each column of records in a database table records each different data item; each data item stores the data corresponding to the data item in each record item; wherein the data is a numerical value; further, the counting module is further configured to: the same data stored by one of the one or more data items of the database table is counted.
Wherein the results related to analyzing the data distribution at least comprise: one data item in the one or more data items of the database table corresponds to a mean, a maximum, a minimum, a quantile, and a mode of the one or more stored data.
Wherein the computing module is configured to: while scanning the each duplicate data and the count value of each duplicate data, calculating a total count value of the total number of the one or more data and the count value of the one or more duplicate data to obtain a mean value of the one or more data; and comparing the size of each duplicate data to obtain the maximum value and the minimum value in the one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data.
Compared with the prior art, the technical scheme according to the application has the following beneficial effects:
according to the data distribution analysis method and device, in the data processing process, data distribution analysis is firstly processed according to the count value of the repeated data in the database table, so that scanning of all data in the database table is avoided, and computer IO is effectively reduced. Furthermore, the method and the device for analyzing the data distribution calculate the related result for analyzing the data distribution based on each repeated data and the counting value of each repeated data, can reduce the data amount in the operation process, improve the operation efficiency, shorten the data processing time and reduce the burden of the operation of the computer. Furthermore, the quantile interval where each data is located can be obtained by calculating the relevant results for analyzing the data distribution, so that the quantile can be rapidly obtained, and the calculation cost is reduced. Therefore, the data processing performance of the computer is effectively improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow diagram of a data processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of the calculation steps of quantiles according to an embodiment of the present application; and
fig. 3 is a block diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
The main idea of the present application is that, in the process of performing data processing on massive data by a computer system, the number of times of repetition of the same data in a database table is counted to obtain a count value table, and then a result related to data distribution analysis (exploration) is obtained based on the count value table. Further, a total count value, a mean value, a maximum value, a minimum value, a mode of one or more data in the database table and an accumulated value of each data may be calculated according to the count value of each data in the count value table, and a quantile interval in which each data is located may be obtained according to the total count value and the accumulated value of each data. And further, the data distribution condition of mass data can be analyzed so as to carry out subsequent data mining and the like. Specifically, reference may be made to the following description of the data processing method of the present application.
In the operation of the application, the counting value table is counted and obtained first, and then the result related to the analysis data distribution is obtained through the counting value table, so that the information in the database table can be effectively compressed, and further, when the result related to the data distribution analysis is calculated, the IO of a computer is reduced, and the calculation efficiency is improved; furthermore, the low processing efficiency caused by high calculation amount in the sorting process is avoided, namely, the calculation amount of a computer is reduced, and the calculation processing burden of a computer system is lightened; furthermore, the processing time of the computer can be reduced, the overhead of the computer operation traversal process can be saved, and the like, so that the performance of the computer is improved.
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present application.
At step S110, one or more data stored in a database table is scanned.
The data may be various data stored in the storage device. For a large amount of data with certain relationships, database tables can be used to store the data in an organized manner.
In a relational database, a database table may be a collection of series of two-dimensional arrays used to represent and store relationships between data objects. It consists of longitudinal columns and transverse rows. For example, a database table for recording student achievement information, the information of each student achievement can be used as a record item (row), the information of each student's name, class, subject, score and the like is used as a data item (column), and corresponding data (numerical value), namely, a row-column intersection point, is recorded in each data item.
Specifically, the computer may pre-record the associated one or more data into a database table (e.g., Table 1). In one embodiment, the database table (table 1) may include one or more entries, where a record refers to a record in the database table, and ten records are shown in the "serial number" column of table 1, which is 1-10 rows of ten records. Each entry may include one or more data items, such as table 1 respective columns "serial number", "user ID", "user gender", etc., i.e., the indicated item of the particular type of information is the data item, which in the example of table 1 indicates that each entry includes a serial number, a user ID, etc. data item. Each data item may have data stored therein, for example, the "user ID" data item in table 1 has data "123," the "serial number" data item has data "1," and so on. Wherein the data stored in the data item may be a numerical value. It should be noted here that, for data to be operated on in a computer system, data in a character type may be represented by a numerical value (binary value), that is, for example: the "data" in "user sex" in table 1 is "male", and the numerical form thereof may also be stored as "00". Thus, one or more data (values) are recorded in the database table in an organized manner.
The computer may read each entry in the database table so that the data stored in each data item contained in each entry may be read. Further, the database table may be stored in the peripheral device, and the computer may perform data processing by reading data in the database table stored in the peripheral device into the memory. In one embodiment, a computer system is required to process data, such as operations, by first obtaining the data stored in a database table. The database table is obtained, i.e., accessed, to read out the data in the database table, which may be in the form of a database table scan, such as a full database table scan. Scanning, refers to the process of searching each entry in a database table until the data in all entries that meet a given condition is returned. Scanning can effectively search each record of the full table of the database table without index so as to obtain all qualified data.
And scanning the whole database table to obtain data, wherein IO is 1.
The following description will be given taking, as an example, an application such as electronic commerce which has a large data volume and requires arithmetic processing to be performed in a computer system. The data related to the e-commerce and related to the user transaction with huge data volume is stored in a database according to a certain organization mode, and the data can be recorded by adopting a database table. As shown in table 1, each piece of transaction information of the user is taken as a record item. The user ID, user sex, transaction ID, commodity ID, transaction amount, and the like in the transaction information may be used as the data item in the record item. And each data stored in each data item is data needing to be operated and processed by the computer system. Taking the numerical values stored in the data items such as the transaction amount and the score as an example, the numerical values are operated, so that the data distribution is analyzed, and the data distribution condition that the transaction amount and the score are concentrated in the numerical value range can be obtained, so as to prepare for further data processing, searching and the like such as data mining and the like.
Table 1:
serial numberUser IDGender of userTransaction IDCommodity IDAmount of transaction
1123For male10010015
2124For male100200315
3125Woman100300515
4126For male100400710
5127Woman10050095
6128Woman10060115
7129Woman10070135
8130Woman100801510
9131For male10090175
10133Woman10100195
The database table of table 1 is only an example and is not limited to the contents of table 1. It includes several data items of serial number, user ID, user's sex, transaction ID, commodity ID and transaction amount. In serial numbers 1-10, a row of records corresponding to each serial number is a record item, and the record item is a transaction record of a user. Each entry includes the aforementioned several data items. And, data relating to the transaction of the user is stored in each data item. The data may be numerical values.
At step S120, the duplicate data in the one or more data is counted to determine a count value for each duplicate data. Further, a count value table can be generated.
The counting of the duplicated data is performed during the full-table scanning process of the database table, i.e. while scanning one or more data stored in the database table (step S110), the duplicated data is counted and the count values are recorded.
Duplicate data refers to identical data or data that appears repeatedly, the identical data or data that appears repeatedly having the same value (numerical value). The count value is the number of repetitions of repeated data in one or more data. In one or more entries, the same data stored in one or more data items are counted based on the same data item to obtain a count value of the same data. That is, there may be duplication of recorded data in a database table, i.e., data having the same value may be repeated one or more times in the database table.
Further, in the whole table scanning process, the number of times that a certain data item in the corresponding data item has the same numerical value as the data item repeatedly appears may be obtained and counted through scanning, and a corresponding count value table is generated. As shown in table 2, a table in the form of "value-count value/number" of a certain data item is generated, i.e. a count value table:
for example: data item A
Value 1, count 10 times;
number 2, count 5 times;
……。
thus, by performing the counting and/or aggregation processing during the scanning period described in step S120, the database can be compressed. Entries containing numerically repeated records may be merged when there is a large amount of data in a database table. For example, if scores of 10 ten thousand examinees in a national examination are recorded in a database table, the database table may be compressed according to the scores (data items) to obtain a count value table, and if the scores are in percent, the count value table only includes a count value of each score from 0 to 100 points, and a total of 100 records are obtained, that is, the database table including 10 ten thousand record items is compressed into a count value table including 100 record items, and the compression ratio is 1000:1, so that the data amount is effectively reduced. The data volume is reduced for the post operation processing, and the calculation burden and the processing pressure of a computer are reduced. And, the process completes the generation of the count and count value table in the process of scanning the whole table, and does not generate additional IO, i.e. IO is 0. The whole is scanned only 1 time (step S110, IO is 1).
In table 1, the data item "transaction amount" is used as a reference, and the transaction amount 15 appears twice during scanning. The number of repeated occurrences of each identical data may be counted during the scanning of one or more data recorded in the database table to obtain the number of repeated occurrences (count value) of each data, and thus obtain the count value table.
Table 2 shows a count table of the count values of the transaction amounts in table 1.
Table 2:
amount of transactionCount value
56
102
152
That is, in the record item of the whole table in table 1, the data item "transaction amount", and the recorded data is the numerical value of "5", which occurs 6 times; data for a transaction amount of 10 occurred 2 times; data for a transaction amount of 15 appears 2 times. Since each entry has a transaction amount data item, that is, data of a transaction amount, it means that there are 6 entries of 5 transaction amounts, 2 entries of 10 transaction amounts, and 2 entries of 15 transaction amounts in table 1.
At step S130, a result related to analyzing the data distribution is calculated from each of the duplicated data and the count value of each of the duplicated data.
The duplicate data may be further processed based on the count value of each duplicate data, so as to obtain a series of results (parameters), and the distribution of one or more data recorded in the database table may be reflected according to the series of results (parameters). For example, the series of results (parameters) may reflect the peak of the data distribution in one or more data sets.
The results (parameters) associated with analyzing the data distribution may include mean, maximum, minimum, quantile, mode, and the like. Wherein the mean is an average of one or more data. The maximum value refers to the largest data among one or more data. The minimum value means the smallest data among one or more data. The mode is the data that is repeated the most number of times among one or more data. The quantile α is a value in the interval (0, 1). The quantile α can correspond to one of one or more data, and among the one or more data, there is data whose proportion is that the quantile α is smaller than the data corresponding to the quantile α.
The repeated data recorded in the count value table and the count value corresponding to the repeated data may be processed by scanning the count value table generated in step S120, so as to calculate a mean value, a maximum value, a minimum value, and a mode of one or more data recorded in the database table. Meanwhile, in the scanning process of the count value table, the total number of record items of the whole table (i.e. how many records or the number of records) and the total value (the sum of data in a certain data item) can be calculated. At this time, the scan, IO, has been greatly compressed. In the example of the examinee score in step S120, IO in the scan count value table is only 0.001 as it is.
While scanning each duplicate data and the count value of each duplicate data, the computer system may calculate a total number of one or more data recorded by a data item in the full table and a total count value of the count values of the one or more duplicate data, thereby obtaining a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data.
Specifically, by scanning the count value table, the total number of record items in the database table, i.e. the total number of records (a plurality of records in total), can be determined in the count value table, and the sum of one or more data of the records can be calculated, so that the average value of one or more data of the records in the database table can be obtained: sum of data/total number of records = mean. Wherein, the total record number may be the total number of the count values of the respective repeated data, i.e. the total count value, which represents the total number of the record items; the sum of the data may be obtained by calculating the products of each of the duplicate data and its count value, and summing the products. And the maximum value and the minimum value can be obtained by pairwise comparing one or more repeated data in the count value table, and the maximum value and the minimum value are the maximum value and the minimum value in one or more data in the database table. And obtaining the repeated data with the largest count value according to the count value of each repeated data in the count value table, wherein the repeated data with the largest count value is the mode of one or more data in the database table.
Further, by combining the mean value of the data obtained by the calculation, the repeated data and the count value thereof recorded in the count value table can be scanned again, and the variance and standard deviation of the data recorded in the database table can be obtained. If the same table of compressed count values is scanned at this time, the IO is extremely small, such as 0.001 IO in the above example of student performance.
Taking the count value table shown in table 2 as an example, scanning the count value table results in that the total number of entries in the database table is 10, i.e., the sum of the respective count values (6 +2+2= 10), the total amount of transaction amount is 80 (the sum of data of transaction amount), i.e., the sum of the products of the respective count values and the respective count values (5 × 6+10 × 2+15 × 2= 80), and then the average value of the transaction amount in table 2 is 8 (80/10 = 8). The transaction amount 5, the transaction amount 10 and the transaction amount 15 are compared pairwise to obtain the maximum value 15 and the minimum value 5. Because the mode of the one or more data is the data with the largest count value, the largest count value in table 2 is the transaction amount 5, which is a count value of 6.
The process of scanning the data recorded in the count value table and the count value corresponding to the data, calculating the quantile to generate a quantile distribution table, and further determining the data or the repeated data in a certain data item corresponding to any given quantile is shown in fig. 2. FIG. 2 is a flow chart of the steps of calculating quantiles according to an embodiment of the present application.
At step S210, when scanning each duplicate data and the count value of each duplicate data, the one or more duplicate data and the count value of the one or more duplicate data may also be sorted according to the size of each duplicate data. For example, in the process of scanning the count value table to calculate the number of entries in step S130, the repeated data in the count value table may be sorted.
One or more repeated data and a count value corresponding to each repeated data are contained in the count value table. The one or more duplicate data is ordered according to the size of the one or more duplicate data. The one or more data are sorted, for example, in an ascending manner.
As shown in table 2, the transaction amount 5 is smaller than the transaction amount 10, the transaction amount 10 is smaller than the transaction amount 15, and the transaction amounts and the count values of the transaction amounts are sorted according to the transaction amounts.
Here, the count value may indicate a certain data (value) of the transaction amount of how many entries are, and may indicate how many transactions are the amount, taking a transaction as an example. That is, the transaction amount of 5 is 6, the transaction amount of 10 is 2, and the transaction amount of 15 is 2.
At step S220, based on the sorting order, a cumulative value of each duplicate data is obtained according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data. Thereby obtaining an accumulated value table. Wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data.
For example: as shown in table 3, a table of accumulated values obtained based on the transaction amount and the count value recorded in table 2. The accumulated value table shown in table 3 includes the transaction amounts and the count values in table 2, and the accumulated value corresponding to each transaction amount is obtained based on the count values in table 2. The transaction amounts are 5, 10 and 15 in ascending order, and the respective count values are 6, 2 and 2 in sequence.
Table 3:
amount of transactionCount valueCumulative valueTotal count value
56610
102810
1521010
In the transaction amount sort order shown in table 2, since the transaction amount arranged before the transaction amount 5 does not exist, the accumulated value corresponding to the transaction amount 5 is the count value 6 of the transaction amount 5. If the transaction amount arranged before the transaction amount 10 is 5, the cumulative value corresponding to the transaction amount 10 is 8 (6 +2= 8). The transaction amounts arranged before the transaction amount 15 are 10 and 15, and the cumulative value corresponding to the transaction amount 15 is 10 (6 +2+2= 10).
Further, table 3 may also include that, from the respective count values recorded in table 2, a total count value (6 +2+2= 10) is obtained.
At step S230, a quantile interval in which each duplicate data is located is obtained according to the accumulated value and the count value of each duplicate data.
The quantile interval in which each duplicate data is located refers to which quantile of each duplicate data or the same data(s) is within the interval made up of which two quantiles, or it can be interpreted between which quantiles one or more data of the same value is located. The quantile interval may include a starting quantile, an ending quantile. The starting quantile is the left end point of the interval, and the ending quantile is the right end point of the interval.
Specifically, scanning the data recorded in the cumulative value table, as shown in table 3, can analyze the distribution of quantiles of one or more data recorded in the database table. Further, the starting quantile and the ending quantile of each repeated data can be calculated according to the counting value, the accumulated value and the total counting value of each repeated data. Further, a formula may be utilized to calculate a starting quantile and an ending quantile for obtaining each duplicate data. The IO that scans the accumulation table is also minimal.
Starting quantile = (accumulated value-count value)/total count value.
End quantile = cumulative value/total count value.
Further, according to the quantile interval where each piece of repeated data is located, data corresponding to any quantile can be calculated, and the any quantile can be any value. Specifically, a quantile interval in which the arbitrary quantile is located can be determined, and the repeated data corresponding to the arbitrary quantile can be determined according to the corresponding relationship between the quantile interval and the repeated data.
Through the method, the repeated data corresponding to any quantile can be determined, so that subsequent data mining can be performed on one or more data in the database table.
Taking table 3 as an example: the starting quantile of the transaction amount 5 is (6-6)/10 =0, the ending quantile is 6/10=0.6, then the quantile interval in which the transaction amount 5 is located is (0, 0.6 ]. the starting quantile of the transaction amount 10 is (8-2)/10 =0.6, the ending quantile is 8/10=0.8, then the quantile interval in which the transaction amount 10 is located is (0.6, 0.8 ]. the starting quantile of the transaction amount 15 is (10-2)/10 =0.8, the ending quantile is 10/10=1, then the quantile interval in which the transaction amount 15 is located is (0.8, 1).
From this, a result table, i.e., a quantile distribution table, can be obtained as shown in Table 4. Therefore, the repeated data corresponding to any quantile can be determined, so that the speed and the efficiency of processing such as mining, searching and the like of one or more data can be further improved.
TABLE 4
Amount of transactionStarting quantileNumber of terminated quantiles
500.6
100.60.8
150.81
Scanning the quantile distribution table, such as table 4, may calculate and determine the duplicate data of a certain data item corresponding to any given quantile according to the region where the quantile α is distributed. The scan IO is still minimal. For example: the quantile interval in which the transaction amount 5 is located is (0, 0.6), the quantile interval in which the transaction amount 10 is located is (0.6, 0.8), the quantile interval in which the transaction amount 15 is located is (0.8, 1 ]. when the repeated data corresponding to the quantile α =0.75 needs to be calculated, it can be determined from table 4 that the quantile interval in which the quantile 0.75 is located is (0.6, 0.8), and the repeated data corresponding to the quantile interval (0.6, 0.8) is 10, the repeated data corresponding to the quantile 0.75 is 10, and thus the value of the transaction amount related to the quantile is 10.
As can be seen from step S130, in the present application, after the full table is scanned for 1 time in step S110, i.e. IO is 1, only the count value table, the cumulative value table, and the quantile distribution table need to be scanned, and since information in the count value table and the like is compressed in a large proportion with respect to information in the database table, each IO is much smaller than 1, so that the operation is completed, and the mean value, the maximum value, the minimum value, the mode, the quantile, and the repeated data (data) and the like corresponding to the quantile can be obtained.
The correlation results obtained by the data processing method, such as the mean value, the maximum value, the minimum value, the mode, the quantile and the like, can be used for analyzing the distribution condition of the data, and further data mining can be carried out. For example: in large-scale electronic commerce, massive commodity transaction information stored in a database can analyze the purchasing preference of a user by analyzing the data distribution condition of transaction amount in transaction details (database tables), so that the using effect of the user can be improved according to the purchasing preference of the user.
According to an aspect of the present application, a data processing apparatus is also provided. As shown in fig. 3, fig. 3 is a block diagram of a data processing apparatus 300 according to an embodiment of the present application.
In the apparatus 300: the description module 310 may be used to scan one or more data stored in a database table. The database table records one or more record items, wherein one record item represents one record, and each row of records in the database table records each record item; each record item comprises one or more data items, each column of records in a database table records each different data item; each data item stores the data corresponding to the data item in each record item; wherein the data is a numerical value. The scanning module 310 may be implemented in step S110.
The counting module 320 may be configured to count duplicate data in the one or more data based on the scan to determine a count value for each duplicate data. The counting module 320 may also be configured to: the same data stored by one of the one or more data items of the database table is counted. The specific implementation of the counting module 320 can be seen in step S120.
The calculation module 330 may be configured to calculate a result associated with analyzing the data distribution based on each of the duplicate data and the count value of each of the duplicate data. Results associated with analyzing the data distribution, including at least: one data item in the one or more data items of the database table corresponds to a mean, a maximum, a minimum, a quantile, a mode of the one or more stored data. The calculation module 330 is configured to: while scanning each repeated data and the count value of each repeated data, calculating a total count value of the total number of the one or more data and the count value of the one or more repeated data to obtain a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data. The specific implementation of the calculation module 330 can be seen in step S130.
The calculation module 330 may further include: a sorting sub-module 331, an accumulation sub-module 332, and an obtaining sub-module 333.
The sorting submodule 331 may be configured to sort the one or more duplicate data and the count value of the one or more duplicate data according to a size of each duplicate data when scanning each duplicate data and the count value of each duplicate data. The specific implementation of the sorting sub-module 331 can be seen in step S210.
The accumulation submodule 332 may be configured to obtain an accumulated value of each duplicate data according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data based on the order formed by the sorting; wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data. The specific implementation of the accumulation sub-module can be seen in step S220.
The obtaining sub-module 333 may be configured to obtain a quantile interval where each piece of repeated data is located according to the accumulated value and the count value of each piece of repeated data. A specific implementation of the obtaining sub-module may be seen in step S230.
The determining module 340 may be configured to determine the repeated data corresponding to any quantile according to a quantile interval in which the any quantile is located, where any quantile may be any value between 0 and 1.
Since the specific embodiments of the various modules included in the apparatus of the present application described in fig. 3 correspond to the specific embodiments of the steps in the method of the present application, the specific details of the various modules will not be described here in order not to obscure the present application, since fig. 1-2 have already been described in detail.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

CN201310597967.8A2013-11-222013-11-22Data processing method and devicePendingCN104657388A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201310597967.8ACN104657388A (en)2013-11-222013-11-22Data processing method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201310597967.8ACN104657388A (en)2013-11-222013-11-22Data processing method and device

Publications (1)

Publication NumberPublication Date
CN104657388Atrue CN104657388A (en)2015-05-27

Family

ID=53248532

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201310597967.8APendingCN104657388A (en)2013-11-222013-11-22Data processing method and device

Country Status (1)

CountryLink
CN (1)CN104657388A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107368281A (en)*2017-04-212017-11-21阿里巴巴集团控股有限公司A kind of data processing method and device
CN110162487A (en)*2019-04-152019-08-23深圳壹账通智能科技有限公司A kind of express statistic number of repetition method, apparatus and storage medium
CN111198904A (en)*2018-11-162020-05-26千寻位置网络有限公司Data processing method and device and processing system

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP1457967A2 (en)*2003-03-132004-09-15Microsoft CorporationCompression of gaussian models
CN101133414A (en)*2005-05-242008-02-27特博数据实验室公司Multiprocessor system and information processing method thereof
CN102332026A (en)*2011-10-102012-01-25深圳中兴网信科技有限公司Inquiring statistical method for service database
CN102521363A (en)*2011-12-152012-06-27武汉达梦数据库有限公司Column partition based numerical data compression method for column storage database
CN102651008A (en)*2011-02-282012-08-29国际商业机器公司Method and equipment for organizing data records in relational data base
CN102737123A (en)*2012-06-132012-10-17北京五八信息技术有限公司Multidimensional data distribution method
CN102880834A (en)*2012-09-032013-01-16西安交通大学Method for protecting privacy information by maintaining numerical characteristics of data numerical
CN103176973A (en)*2011-12-202013-06-26国际商业机器公司System and method used for generating test working load of data base

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP1457967A2 (en)*2003-03-132004-09-15Microsoft CorporationCompression of gaussian models
CN101133414A (en)*2005-05-242008-02-27特博数据实验室公司Multiprocessor system and information processing method thereof
US20080215584A1 (en)*2005-05-242008-09-04Shinji FurushoShared-Memory Multiprocessor System and Method for Processing Information
CN102651008A (en)*2011-02-282012-08-29国际商业机器公司Method and equipment for organizing data records in relational data base
CN102332026A (en)*2011-10-102012-01-25深圳中兴网信科技有限公司Inquiring statistical method for service database
CN102521363A (en)*2011-12-152012-06-27武汉达梦数据库有限公司Column partition based numerical data compression method for column storage database
CN103176973A (en)*2011-12-202013-06-26国际商业机器公司System and method used for generating test working load of data base
CN102737123A (en)*2012-06-132012-10-17北京五八信息技术有限公司Multidimensional data distribution method
CN102880834A (en)*2012-09-032013-01-16西安交通大学Method for protecting privacy information by maintaining numerical characteristics of data numerical

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASIM ABDALLAH ELSHIEKH ET AL.: ""Statistical analysis for the key representation database and the original database"", 《2010 INTERNATIONAL SYMPOSIUM ON INFORMATION TECHNOLOGY》*
向期中: "《信息学奥林匹克教程 数据结构篇》", 31 August 2006*
周璐: "《社会研究方法使用教程》", 28 February 2009, 上海交通大学出版社*
王曰芬 等: ""数据清洗研究综述"", 《现代图书情报计数》*

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107368281A (en)*2017-04-212017-11-21阿里巴巴集团控股有限公司A kind of data processing method and device
CN107368281B (en)*2017-04-212020-10-16阿里巴巴集团控股有限公司Data processing method and device
CN111198904A (en)*2018-11-162020-05-26千寻位置网络有限公司Data processing method and device and processing system
CN110162487A (en)*2019-04-152019-08-23深圳壹账通智能科技有限公司A kind of express statistic number of repetition method, apparatus and storage medium

Similar Documents

PublicationPublication DateTitle
CN107844565B (en)Commodity searching method and device
EP3117347B1 (en)Systems and methods for rapid data analysis
US11841839B1 (en)Preprocessing and imputing method for structural data
US9165052B2 (en)Density based clustering for multidimensional data
CN101477542B (en)Sampling analysis method, system and equipment
CN109816482B (en)Knowledge graph construction method, device and equipment of e-commerce platform and storage medium
CN112182071B (en)Data association relation mining method and device, electronic equipment and storage medium
US20170010123A1 (en)Hybrid road network and grid based spatial-temporal indexing under missing road links
US20150032708A1 (en)Database analysis apparatus and method
EP3709127A1 (en)Novel olap precomputation model and precomputation result generation method
Buza et al.Storage-optimizing clustering algorithms for high-dimensional tick data
CN117081602B (en)Capital settlement data optimization processing method based on blockchain
CN117708222A (en) Association rule mining method for customer segmentation
CN104657388A (en)Data processing method and device
CN116610700A (en)Query statement detection method and device and storage medium
CN108334532B (en) A Spark-based Eclat parallelization method, system and device
CN110765100B (en)Label generation method and device, computer readable storage medium and server
CN113918561B (en)Hybrid query method and system based on analysis scene on cloud and storage medium
CN104714956A (en)Comparison method and device for isomerism record sets
CN110895562A (en)Feedback information processing method and device
CN109739839A (en)Data processing empty value method, apparatus and terminal device
CN117290355A (en)Metadata map construction system
CN111897803B (en)Database integrity evaluation method for power industry service system
CN108062395A (en)A kind of track traffic big data analysis method and system
US11366833B2 (en)Augmenting project data with searchable metadata for facilitating project queries

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
TA01Transfer of patent application right

Effective date of registration:20191205

Address after:P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Applicant after:Innovative advanced technology Co., Ltd

Address before:A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before:Alibaba Group Holding Co., Ltd.

TA01Transfer of patent application right
RJ01Rejection of invention patent application after publication

Application publication date:20150527

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp