Detailed Description
The main idea of the present application is that, in the process of performing data processing on massive data by a computer system, the number of times of repetition of the same data in a database table is counted to obtain a count value table, and then a result related to data distribution analysis (exploration) is obtained based on the count value table. Further, a total count value, a mean value, a maximum value, a minimum value, a mode of one or more data in the database table and an accumulated value of each data may be calculated according to the count value of each data in the count value table, and a quantile interval in which each data is located may be obtained according to the total count value and the accumulated value of each data. And further, the data distribution condition of mass data can be analyzed so as to carry out subsequent data mining and the like. Specifically, reference may be made to the following description of the data processing method of the present application.
In the operation of the application, the counting value table is counted and obtained first, and then the result related to the analysis data distribution is obtained through the counting value table, so that the information in the database table can be effectively compressed, and further, when the result related to the data distribution analysis is calculated, the IO of a computer is reduced, and the calculation efficiency is improved; furthermore, the low processing efficiency caused by high calculation amount in the sorting process is avoided, namely, the calculation amount of a computer is reduced, and the calculation processing burden of a computer system is lightened; furthermore, the processing time of the computer can be reduced, the overhead of the computer operation traversal process can be saved, and the like, so that the performance of the computer is improved.
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present application.
At step S110, one or more data stored in a database table is scanned.
The data may be various data stored in the storage device. For a large amount of data with certain relationships, database tables can be used to store the data in an organized manner.
In a relational database, a database table may be a collection of series of two-dimensional arrays used to represent and store relationships between data objects. It consists of longitudinal columns and transverse rows. For example, a database table for recording student achievement information, the information of each student achievement can be used as a record item (row), the information of each student's name, class, subject, score and the like is used as a data item (column), and corresponding data (numerical value), namely, a row-column intersection point, is recorded in each data item.
Specifically, the computer may pre-record the associated one or more data into a database table (e.g., Table 1). In one embodiment, the database table (table 1) may include one or more entries, where a record refers to a record in the database table, and ten records are shown in the "serial number" column of table 1, which is 1-10 rows of ten records. Each entry may include one or more data items, such as table 1 respective columns "serial number", "user ID", "user gender", etc., i.e., the indicated item of the particular type of information is the data item, which in the example of table 1 indicates that each entry includes a serial number, a user ID, etc. data item. Each data item may have data stored therein, for example, the "user ID" data item in table 1 has data "123," the "serial number" data item has data "1," and so on. Wherein the data stored in the data item may be a numerical value. It should be noted here that, for data to be operated on in a computer system, data in a character type may be represented by a numerical value (binary value), that is, for example: the "data" in "user sex" in table 1 is "male", and the numerical form thereof may also be stored as "00". Thus, one or more data (values) are recorded in the database table in an organized manner.
The computer may read each entry in the database table so that the data stored in each data item contained in each entry may be read. Further, the database table may be stored in the peripheral device, and the computer may perform data processing by reading data in the database table stored in the peripheral device into the memory. In one embodiment, a computer system is required to process data, such as operations, by first obtaining the data stored in a database table. The database table is obtained, i.e., accessed, to read out the data in the database table, which may be in the form of a database table scan, such as a full database table scan. Scanning, refers to the process of searching each entry in a database table until the data in all entries that meet a given condition is returned. Scanning can effectively search each record of the full table of the database table without index so as to obtain all qualified data.
And scanning the whole database table to obtain data, wherein IO is 1.
The following description will be given taking, as an example, an application such as electronic commerce which has a large data volume and requires arithmetic processing to be performed in a computer system. The data related to the e-commerce and related to the user transaction with huge data volume is stored in a database according to a certain organization mode, and the data can be recorded by adopting a database table. As shown in table 1, each piece of transaction information of the user is taken as a record item. The user ID, user sex, transaction ID, commodity ID, transaction amount, and the like in the transaction information may be used as the data item in the record item. And each data stored in each data item is data needing to be operated and processed by the computer system. Taking the numerical values stored in the data items such as the transaction amount and the score as an example, the numerical values are operated, so that the data distribution is analyzed, and the data distribution condition that the transaction amount and the score are concentrated in the numerical value range can be obtained, so as to prepare for further data processing, searching and the like such as data mining and the like.
Table 1:
| serial number | User ID | Gender of user | Transaction ID | Commodity ID | Amount of transaction |
| 1 | 123 | For male | 1001 | 001 | 5 |
| 2 | 124 | For male | 1002 | 003 | 15 |
| 3 | 125 | Woman | 1003 | 005 | 15 |
| 4 | 126 | For male | 1004 | 007 | 10 |
| 5 | 127 | Woman | 1005 | 009 | 5 |
| 6 | 128 | Woman | 1006 | 011 | 5 |
| 7 | 129 | Woman | 1007 | 013 | 5 |
| 8 | 130 | Woman | 1008 | 015 | 10 |
| 9 | 131 | For male | 1009 | 017 | 5 |
| 10 | 133 | Woman | 1010 | 019 | 5 |
The database table of table 1 is only an example and is not limited to the contents of table 1. It includes several data items of serial number, user ID, user's sex, transaction ID, commodity ID and transaction amount. In serial numbers 1-10, a row of records corresponding to each serial number is a record item, and the record item is a transaction record of a user. Each entry includes the aforementioned several data items. And, data relating to the transaction of the user is stored in each data item. The data may be numerical values.
At step S120, the duplicate data in the one or more data is counted to determine a count value for each duplicate data. Further, a count value table can be generated.
The counting of the duplicated data is performed during the full-table scanning process of the database table, i.e. while scanning one or more data stored in the database table (step S110), the duplicated data is counted and the count values are recorded.
Duplicate data refers to identical data or data that appears repeatedly, the identical data or data that appears repeatedly having the same value (numerical value). The count value is the number of repetitions of repeated data in one or more data. In one or more entries, the same data stored in one or more data items are counted based on the same data item to obtain a count value of the same data. That is, there may be duplication of recorded data in a database table, i.e., data having the same value may be repeated one or more times in the database table.
Further, in the whole table scanning process, the number of times that a certain data item in the corresponding data item has the same numerical value as the data item repeatedly appears may be obtained and counted through scanning, and a corresponding count value table is generated. As shown in table 2, a table in the form of "value-count value/number" of a certain data item is generated, i.e. a count value table:
for example: data item A
Value 1, count 10 times;
number 2, count 5 times;
……。
thus, by performing the counting and/or aggregation processing during the scanning period described in step S120, the database can be compressed. Entries containing numerically repeated records may be merged when there is a large amount of data in a database table. For example, if scores of 10 ten thousand examinees in a national examination are recorded in a database table, the database table may be compressed according to the scores (data items) to obtain a count value table, and if the scores are in percent, the count value table only includes a count value of each score from 0 to 100 points, and a total of 100 records are obtained, that is, the database table including 10 ten thousand record items is compressed into a count value table including 100 record items, and the compression ratio is 1000:1, so that the data amount is effectively reduced. The data volume is reduced for the post operation processing, and the calculation burden and the processing pressure of a computer are reduced. And, the process completes the generation of the count and count value table in the process of scanning the whole table, and does not generate additional IO, i.e. IO is 0. The whole is scanned only 1 time (step S110, IO is 1).
In table 1, the data item "transaction amount" is used as a reference, and the transaction amount 15 appears twice during scanning. The number of repeated occurrences of each identical data may be counted during the scanning of one or more data recorded in the database table to obtain the number of repeated occurrences (count value) of each data, and thus obtain the count value table.
Table 2 shows a count table of the count values of the transaction amounts in table 1.
Table 2:
| amount of transaction | Count value |
That is, in the record item of the whole table in table 1, the data item "transaction amount", and the recorded data is the numerical value of "5", which occurs 6 times; data for a transaction amount of 10 occurred 2 times; data for a transaction amount of 15 appears 2 times. Since each entry has a transaction amount data item, that is, data of a transaction amount, it means that there are 6 entries of 5 transaction amounts, 2 entries of 10 transaction amounts, and 2 entries of 15 transaction amounts in table 1.
At step S130, a result related to analyzing the data distribution is calculated from each of the duplicated data and the count value of each of the duplicated data.
The duplicate data may be further processed based on the count value of each duplicate data, so as to obtain a series of results (parameters), and the distribution of one or more data recorded in the database table may be reflected according to the series of results (parameters). For example, the series of results (parameters) may reflect the peak of the data distribution in one or more data sets.
The results (parameters) associated with analyzing the data distribution may include mean, maximum, minimum, quantile, mode, and the like. Wherein the mean is an average of one or more data. The maximum value refers to the largest data among one or more data. The minimum value means the smallest data among one or more data. The mode is the data that is repeated the most number of times among one or more data. The quantile α is a value in the interval (0, 1). The quantile α can correspond to one of one or more data, and among the one or more data, there is data whose proportion is that the quantile α is smaller than the data corresponding to the quantile α.
The repeated data recorded in the count value table and the count value corresponding to the repeated data may be processed by scanning the count value table generated in step S120, so as to calculate a mean value, a maximum value, a minimum value, and a mode of one or more data recorded in the database table. Meanwhile, in the scanning process of the count value table, the total number of record items of the whole table (i.e. how many records or the number of records) and the total value (the sum of data in a certain data item) can be calculated. At this time, the scan, IO, has been greatly compressed. In the example of the examinee score in step S120, IO in the scan count value table is only 0.001 as it is.
While scanning each duplicate data and the count value of each duplicate data, the computer system may calculate a total number of one or more data recorded by a data item in the full table and a total count value of the count values of the one or more duplicate data, thereby obtaining a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data.
Specifically, by scanning the count value table, the total number of record items in the database table, i.e. the total number of records (a plurality of records in total), can be determined in the count value table, and the sum of one or more data of the records can be calculated, so that the average value of one or more data of the records in the database table can be obtained: sum of data/total number of records = mean. Wherein, the total record number may be the total number of the count values of the respective repeated data, i.e. the total count value, which represents the total number of the record items; the sum of the data may be obtained by calculating the products of each of the duplicate data and its count value, and summing the products. And the maximum value and the minimum value can be obtained by pairwise comparing one or more repeated data in the count value table, and the maximum value and the minimum value are the maximum value and the minimum value in one or more data in the database table. And obtaining the repeated data with the largest count value according to the count value of each repeated data in the count value table, wherein the repeated data with the largest count value is the mode of one or more data in the database table.
Further, by combining the mean value of the data obtained by the calculation, the repeated data and the count value thereof recorded in the count value table can be scanned again, and the variance and standard deviation of the data recorded in the database table can be obtained. If the same table of compressed count values is scanned at this time, the IO is extremely small, such as 0.001 IO in the above example of student performance.
Taking the count value table shown in table 2 as an example, scanning the count value table results in that the total number of entries in the database table is 10, i.e., the sum of the respective count values (6 +2+2= 10), the total amount of transaction amount is 80 (the sum of data of transaction amount), i.e., the sum of the products of the respective count values and the respective count values (5 × 6+10 × 2+15 × 2= 80), and then the average value of the transaction amount in table 2 is 8 (80/10 = 8). The transaction amount 5, the transaction amount 10 and the transaction amount 15 are compared pairwise to obtain the maximum value 15 and the minimum value 5. Because the mode of the one or more data is the data with the largest count value, the largest count value in table 2 is the transaction amount 5, which is a count value of 6.
The process of scanning the data recorded in the count value table and the count value corresponding to the data, calculating the quantile to generate a quantile distribution table, and further determining the data or the repeated data in a certain data item corresponding to any given quantile is shown in fig. 2. FIG. 2 is a flow chart of the steps of calculating quantiles according to an embodiment of the present application.
At step S210, when scanning each duplicate data and the count value of each duplicate data, the one or more duplicate data and the count value of the one or more duplicate data may also be sorted according to the size of each duplicate data. For example, in the process of scanning the count value table to calculate the number of entries in step S130, the repeated data in the count value table may be sorted.
One or more repeated data and a count value corresponding to each repeated data are contained in the count value table. The one or more duplicate data is ordered according to the size of the one or more duplicate data. The one or more data are sorted, for example, in an ascending manner.
As shown in table 2, the transaction amount 5 is smaller than the transaction amount 10, the transaction amount 10 is smaller than the transaction amount 15, and the transaction amounts and the count values of the transaction amounts are sorted according to the transaction amounts.
Here, the count value may indicate a certain data (value) of the transaction amount of how many entries are, and may indicate how many transactions are the amount, taking a transaction as an example. That is, the transaction amount of 5 is 6, the transaction amount of 10 is 2, and the transaction amount of 15 is 2.
At step S220, based on the sorting order, a cumulative value of each duplicate data is obtained according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data. Thereby obtaining an accumulated value table. Wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data.
For example: as shown in table 3, a table of accumulated values obtained based on the transaction amount and the count value recorded in table 2. The accumulated value table shown in table 3 includes the transaction amounts and the count values in table 2, and the accumulated value corresponding to each transaction amount is obtained based on the count values in table 2. The transaction amounts are 5, 10 and 15 in ascending order, and the respective count values are 6, 2 and 2 in sequence.
Table 3:
| amount of transaction | Count value | Cumulative value | Total count value |
In the transaction amount sort order shown in table 2, since the transaction amount arranged before the transaction amount 5 does not exist, the accumulated value corresponding to the transaction amount 5 is the count value 6 of the transaction amount 5. If the transaction amount arranged before the transaction amount 10 is 5, the cumulative value corresponding to the transaction amount 10 is 8 (6 +2= 8). The transaction amounts arranged before the transaction amount 15 are 10 and 15, and the cumulative value corresponding to the transaction amount 15 is 10 (6 +2+2= 10).
Further, table 3 may also include that, from the respective count values recorded in table 2, a total count value (6 +2+2= 10) is obtained.
At step S230, a quantile interval in which each duplicate data is located is obtained according to the accumulated value and the count value of each duplicate data.
The quantile interval in which each duplicate data is located refers to which quantile of each duplicate data or the same data(s) is within the interval made up of which two quantiles, or it can be interpreted between which quantiles one or more data of the same value is located. The quantile interval may include a starting quantile, an ending quantile. The starting quantile is the left end point of the interval, and the ending quantile is the right end point of the interval.
Specifically, scanning the data recorded in the cumulative value table, as shown in table 3, can analyze the distribution of quantiles of one or more data recorded in the database table. Further, the starting quantile and the ending quantile of each repeated data can be calculated according to the counting value, the accumulated value and the total counting value of each repeated data. Further, a formula may be utilized to calculate a starting quantile and an ending quantile for obtaining each duplicate data. The IO that scans the accumulation table is also minimal.
Starting quantile = (accumulated value-count value)/total count value.
End quantile = cumulative value/total count value.
Further, according to the quantile interval where each piece of repeated data is located, data corresponding to any quantile can be calculated, and the any quantile can be any value. Specifically, a quantile interval in which the arbitrary quantile is located can be determined, and the repeated data corresponding to the arbitrary quantile can be determined according to the corresponding relationship between the quantile interval and the repeated data.
Through the method, the repeated data corresponding to any quantile can be determined, so that subsequent data mining can be performed on one or more data in the database table.
Taking table 3 as an example: the starting quantile of the transaction amount 5 is (6-6)/10 =0, the ending quantile is 6/10=0.6, then the quantile interval in which the transaction amount 5 is located is (0, 0.6 ]. the starting quantile of the transaction amount 10 is (8-2)/10 =0.6, the ending quantile is 8/10=0.8, then the quantile interval in which the transaction amount 10 is located is (0.6, 0.8 ]. the starting quantile of the transaction amount 15 is (10-2)/10 =0.8, the ending quantile is 10/10=1, then the quantile interval in which the transaction amount 15 is located is (0.8, 1).
From this, a result table, i.e., a quantile distribution table, can be obtained as shown in Table 4. Therefore, the repeated data corresponding to any quantile can be determined, so that the speed and the efficiency of processing such as mining, searching and the like of one or more data can be further improved.
TABLE 4
| Amount of transaction | Starting quantile | Number of terminated quantiles |
| 5 | 0 | 0.6 |
| 10 | 0.6 | 0.8 |
| 15 | 0.8 | 1 |
Scanning the quantile distribution table, such as table 4, may calculate and determine the duplicate data of a certain data item corresponding to any given quantile according to the region where the quantile α is distributed. The scan IO is still minimal. For example: the quantile interval in which the transaction amount 5 is located is (0, 0.6), the quantile interval in which the transaction amount 10 is located is (0.6, 0.8), the quantile interval in which the transaction amount 15 is located is (0.8, 1 ]. when the repeated data corresponding to the quantile α =0.75 needs to be calculated, it can be determined from table 4 that the quantile interval in which the quantile 0.75 is located is (0.6, 0.8), and the repeated data corresponding to the quantile interval (0.6, 0.8) is 10, the repeated data corresponding to the quantile 0.75 is 10, and thus the value of the transaction amount related to the quantile is 10.
As can be seen from step S130, in the present application, after the full table is scanned for 1 time in step S110, i.e. IO is 1, only the count value table, the cumulative value table, and the quantile distribution table need to be scanned, and since information in the count value table and the like is compressed in a large proportion with respect to information in the database table, each IO is much smaller than 1, so that the operation is completed, and the mean value, the maximum value, the minimum value, the mode, the quantile, and the repeated data (data) and the like corresponding to the quantile can be obtained.
The correlation results obtained by the data processing method, such as the mean value, the maximum value, the minimum value, the mode, the quantile and the like, can be used for analyzing the distribution condition of the data, and further data mining can be carried out. For example: in large-scale electronic commerce, massive commodity transaction information stored in a database can analyze the purchasing preference of a user by analyzing the data distribution condition of transaction amount in transaction details (database tables), so that the using effect of the user can be improved according to the purchasing preference of the user.
According to an aspect of the present application, a data processing apparatus is also provided. As shown in fig. 3, fig. 3 is a block diagram of a data processing apparatus 300 according to an embodiment of the present application.
In the apparatus 300: the description module 310 may be used to scan one or more data stored in a database table. The database table records one or more record items, wherein one record item represents one record, and each row of records in the database table records each record item; each record item comprises one or more data items, each column of records in a database table records each different data item; each data item stores the data corresponding to the data item in each record item; wherein the data is a numerical value. The scanning module 310 may be implemented in step S110.
The counting module 320 may be configured to count duplicate data in the one or more data based on the scan to determine a count value for each duplicate data. The counting module 320 may also be configured to: the same data stored by one of the one or more data items of the database table is counted. The specific implementation of the counting module 320 can be seen in step S120.
The calculation module 330 may be configured to calculate a result associated with analyzing the data distribution based on each of the duplicate data and the count value of each of the duplicate data. Results associated with analyzing the data distribution, including at least: one data item in the one or more data items of the database table corresponds to a mean, a maximum, a minimum, a quantile, a mode of the one or more stored data. The calculation module 330 is configured to: while scanning each repeated data and the count value of each repeated data, calculating a total count value of the total number of the one or more data and the count value of the one or more repeated data to obtain a mean value of the one or more data; comparing the size of each repeated data to obtain the maximum value and the minimum value in one or more data; and comparing the size of the count value of each duplicate data to obtain a sum mode in the one or more data. The specific implementation of the calculation module 330 can be seen in step S130.
The calculation module 330 may further include: a sorting sub-module 331, an accumulation sub-module 332, and an obtaining sub-module 333.
The sorting submodule 331 may be configured to sort the one or more duplicate data and the count value of the one or more duplicate data according to a size of each duplicate data when scanning each duplicate data and the count value of each duplicate data. The specific implementation of the sorting sub-module 331 can be seen in step S210.
The accumulation submodule 332 may be configured to obtain an accumulated value of each duplicate data according to the count value of each duplicate data and the count values of all the duplicate data arranged in front of each duplicate data based on the order formed by the sorting; wherein the accumulated value is a sum of the count value of each of the duplicated data and the count values of all of the duplicated data arranged in front of each of the duplicated data. The specific implementation of the accumulation sub-module can be seen in step S220.
The obtaining sub-module 333 may be configured to obtain a quantile interval where each piece of repeated data is located according to the accumulated value and the count value of each piece of repeated data. A specific implementation of the obtaining sub-module may be seen in step S230.
The determining module 340 may be configured to determine the repeated data corresponding to any quantile according to a quantile interval in which the any quantile is located, where any quantile may be any value between 0 and 1.
Since the specific embodiments of the various modules included in the apparatus of the present application described in fig. 3 correspond to the specific embodiments of the steps in the method of the present application, the specific details of the various modules will not be described here in order not to obscure the present application, since fig. 1-2 have already been described in detail.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.