Movatterモバイル変換


[0]ホーム

URL:


CN112051965B - Data processing method and device - Google Patents

Data processing method and device

Info

Publication number
CN112051965B
CN112051965BCN201910495929.9ACN201910495929ACN112051965BCN 112051965 BCN112051965 BCN 112051965BCN 201910495929 ACN201910495929 ACN 201910495929ACN 112051965 BCN112051965 BCN 112051965B
Authority
CN
China
Prior art keywords
data
columns
processed
data table
storage space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910495929.9A
Other languages
Chinese (zh)
Other versions
CN112051965A (en
Inventor
姜喆
高阳
俞飞江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding LtdfiledCriticalAlibaba Group Holding Ltd
Priority to CN201910495929.9ApriorityCriticalpatent/CN112051965B/en
Publication of CN112051965ApublicationCriticalpatent/CN112051965A/en
Application grantedgrantedCritical
Publication of CN112051965BpublicationCriticalpatent/CN112051965B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请公开了一种数据处理方法及装置。其中所述方法,包括:获取待处理数据表;根据所述待处理数据表中的数据列对所述待处理数据表占用存储空间大小的影响因素,从所述待处理数据表中选取第一数量的数据列;根据所述第一数量的数据列中的目标数据块的数量,从所述待处理数据表中选取第二数量的数据列,其中,所述目标数据块与所述第一数量的数据列中的至少一个数据块互为相似数据块;在所述待处理数据表中调整所述第二数量的数据列之间的排列顺序,获得新数据表,所述新数据表所占用的存储空间小于所述待处理数据表所占用的存储空间。采用本申请提供的方法,减小了数据表的存储空间。

The present application discloses a data processing method and device. The method includes: obtaining a data table to be processed; selecting a first number of data columns from the data table to be processed based on the factors affecting the storage space occupied by the data columns in the data table to be processed; selecting a second number of data columns from the data table to be processed based on the number of target data blocks in the first number of data columns, wherein the target data block and at least one data block in the first number of data columns are similar data blocks; and adjusting the arrangement order of the second number of data columns in the data table to be processed to obtain a new data table, wherein the storage space occupied by the new data table is less than the storage space occupied by the data table to be processed. Using the method provided by the present application, the storage space of the data table is reduced.

Description

Data processing method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a data processing method and apparatus.
Background
With the rapid development of data services, data service platforms need to store more and more data. Storing these data requires a huge amount of storage media, and the overhead of using these storage media is very large.
In the prior art, data tables stored in a storage medium are arranged in the chronological order of data generation. In order to reduce the storage space of the data table, the positions of the data columns are generally kept unchanged before being saved to the storage medium, and compression processing is performed only for the data of each data column.
However, with this data processing method, even if there is similar data between data columns that are far apart, the similar data cannot be uniformly compressed, resulting in an insufficient effect of reducing the storage space of the data table after processing.
Disclosure of Invention
The application provides a data processing method for reducing the storage space of a data table.
The method comprises the following steps:
acquiring a data table to be processed;
Selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed;
Selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar data blocks;
and adjusting the arrangement sequence among the second number of data columns in the data table to be processed to obtain a new data table, wherein the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed.
Optionally, the obtaining the data table to be processed includes:
acquiring the size of a storage space occupied by a data table stored in a nonvolatile storage medium;
And acquiring the data table to be processed, which needs to compress the storage space, from the nonvolatile storage medium according to the size of the storage space occupied by the data table.
Optionally, the factors affecting the size of the storage space occupied by the data column in the data table to be processed include average field length of the data column in the data table to be processed;
Selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the storage size of the data table to be processed, wherein the method comprises the following steps:
and selecting a first number of data columns with average field length larger than a specified field length threshold from the data table to be processed.
Optionally, the data processing method further includes:
Obtaining a target data block in the first number of data columns;
the number of target data blocks in the first number of data columns is calculated.
Optionally, the obtaining the target data block in the first number of data columns includes:
performing word segmentation on the data blocks in the first number of data columns to obtain word segmentation results;
According to the word segmentation result, obtaining a feature vector of the data block;
Obtaining the distance between any two data blocks according to the feature vectors in the any two data blocks:
judging whether any two data blocks are similar data blocks or not according to the distance between any two data blocks;
and if the arbitrary two data blocks are similar data blocks, determining the arbitrary two data blocks as the target data block.
Optionally, the obtaining the feature vector of the data block according to the word segmentation result includes:
according to the word segmentation result, the number of words contained in the segmented data block is obtained;
Generating a word dictionary after word segmentation according to the number of words contained in the data block after word segmentation;
and obtaining the feature vector of the data block according to the word dictionary after word segmentation.
Optionally, the obtaining the distance between any two data blocks according to the feature vector of the any two data blocks includes:
According to the feature vectors of any two data blocks, obtaining the Euclidean distance between any two data blocks or the cosine distance between any two data blocks;
And obtaining the distance between any two data blocks according to the Euclidean distance between any two data blocks or the cosine distance between any two data blocks.
Optionally, the determining, according to the distance between any two data blocks, whether any two data blocks are similar data blocks, includes:
and if the distance between any two data blocks is smaller than the specified distance threshold value, determining that the any two data blocks are similar data blocks.
Optionally, in the new data table, any one of the second number of data columns is adjacent to the other one of the second number of data columns.
Optionally, the adjusting the arrangement sequence between the second number of data columns in the data table to be processed to obtain a new data table includes:
Setting different arrangement sequences among the second number of data columns in the data table to be processed to obtain a plurality of new data tables;
The method further comprises the steps of:
Selecting a new data table with the smallest occupied storage space from the plurality of new data tables;
and storing the new data table with the smallest occupied storage space into a storage medium.
Optionally, the data processing method further comprises judging whether the size of the storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed reaches or exceeds a storage space size threshold;
Storing the new data table with the minimum occupied storage space into a storage medium comprises storing the new data table with the minimum occupied storage space into the storage medium if the storage space saved by the new data table with the minimum occupied storage space relative to the data table to be processed is lower than a storage space size threshold value.
Optionally, the data processing method further comprises the step of maintaining the arrangement sequence among the data columns in the data table to be processed if the size of the storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed reaches or exceeds the storage space size threshold.
The present application provides a data processing apparatus comprising:
the acquisition unit is used for acquiring the data table to be processed;
the first selecting unit is used for selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed;
a second selecting unit, configured to select a second number of data columns from the first number of data columns according to the number of target data blocks in the first number of data columns, where the target data blocks and at least one data block in the first number of data columns are similar data blocks;
The adjusting unit is used for adjusting the arrangement sequence among the second number of data columns in the data table to be processed to obtain a new data table, and the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed.
Optionally, the acquiring unit is specifically configured to:
acquiring the size of a storage space occupied by a data table stored in a nonvolatile storage medium;
And acquiring the data table to be processed, which needs to compress the storage space, from the nonvolatile storage medium according to the size of the storage space occupied by the data table.
Optionally, the first selecting unit is specifically configured to:
and selecting a first number of data columns with average field length larger than a specified field length threshold from the data table to be processed.
Optionally, the data processing apparatus further includes a computing unit, where the computing unit is configured to:
Obtaining a target data block in the first number of data columns;
the number of target data blocks in the first number of data columns is calculated.
Optionally, the second selecting unit is specifically configured to:
performing word segmentation on the data blocks in the first number of data columns to obtain word segmentation results;
According to the word segmentation result, obtaining a feature vector of the data block;
Obtaining the distance between any two data blocks according to the feature vectors of the any two data blocks;
judging whether any two data blocks are similar data blocks or not according to the distance between any two data blocks;
and if the arbitrary two data blocks are similar data blocks, determining the arbitrary two data blocks as the target data block.
Optionally, the second selecting unit is further configured to:
according to the word segmentation result, the number of words contained in the segmented data block is obtained;
Generating a word dictionary after word segmentation according to the number of words contained in the data block after word segmentation;
and obtaining the feature vector of the data block according to the word dictionary after word segmentation.
Optionally, the second selecting unit is further configured to:
According to the feature vectors of any two data blocks, obtaining the Euclidean distance between any two data blocks or the cosine distance between any two data blocks;
And obtaining the distance between any two data blocks according to the Euclidean distance between any two data blocks or the cosine distance between any two data blocks.
Optionally, the second selecting unit is further configured to:
and if the distance between any two data blocks is smaller than the specified distance threshold value, determining that the any two data blocks are similar data blocks.
Optionally, in the new data table, any one of the second number of data columns is adjacent to the other one of the second number of data columns.
Optionally, the adjusting unit is specifically configured to:
Setting different arrangement sequences among the second number of data columns in the data table to be processed to obtain a plurality of new data tables;
the data processing apparatus further comprises a storage unit for:
Selecting a new data table with the smallest occupied storage space from the plurality of new data tables;
and storing the new data table with the smallest occupied storage space into a storage medium.
Optionally, the data processing device further includes a judging unit, configured to judge whether a size of a storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed reaches or exceeds a storage space size threshold;
The storage unit is used for storing the new data table with the minimum occupied storage space into a storage medium if the storage space saved by the new data table with the minimum occupied storage space relative to the data table to be processed is lower than a storage space size threshold value.
Optionally, the storage unit is further configured to maintain an arrangement order among the data columns in the data table to be processed if the size of the new data table with the smallest occupied storage space relative to the storage space saved by the data table to be processed reaches or exceeds the storage space size threshold.
The application provides a data processing method, which comprises the following steps:
acquiring a data table to be processed;
Selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed;
Selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar data blocks;
selecting a third number of data columns with time attribute from the first number of data columns;
and in the data table to be processed, aiming at the second number of data columns and the third number of data columns, adjusting the arrangement sequence among the data columns to obtain a new data table, wherein the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed.
Optionally, the selecting a third number of data columns with a time attribute from the first number of data columns includes:
and selecting a third number of data columns with time attributes from the first number of data columns according to the specified time regular expression.
The present application provides a data processing apparatus comprising:
The first acquisition unit is used for acquiring a data table to be processed;
A third selecting unit, configured to select a first number of data columns from the to-be-processed data table according to an influence factor of the data columns in the to-be-processed data table on the size of the storage space occupied by the to-be-processed data table;
A fourth selecting unit, configured to select a second number of data columns from the to-be-processed data table according to the number of target data blocks in the first number of data columns, where the target data blocks and at least one data block in the first number of data columns are similar data blocks;
a fifth selecting unit, configured to select a third number of data columns with a time attribute from the first number of data columns;
The second adjusting unit is configured to adjust an arrangement sequence between the data columns in the to-be-processed data table according to the second number of data columns and the third number of data columns, so as to obtain a new data table, where a storage space occupied by the new data table is smaller than a storage space occupied by the to-be-processed data table.
Optionally, the fifth selecting unit is specifically configured to:
and selecting a third number of data columns with time attributes from the first number of data columns according to the specified time regular expression.
The present application provides an electronic device including:
A processor;
A kind of electronic device with a high-pressure air-conditioning system.
And a memory for storing a computer program, the apparatus executing any one of the data processing methods described above after the computer program is executed by the processor.
The present application provides a computer storage medium storing a computer program for execution by a processor to perform any one of the data processing methods described above.
The application provides a data processing method, which comprises the following steps:
acquiring a data table A to be processed, wherein the A comprises a plurality of data columns, one data column comprises at least one data block, the at least one data column corresponds to an influence factor, and the influence factor represents an influence factor of one data column on the size of the storage space occupied by the A;
Selecting a plurality of data columns with influence factors meeting a first preset condition from the A;
Based on the similarity of the data blocks included in the plurality of data columns meeting the first preset condition, adjusting the sequence of the data columns with the similarity meeting the second preset condition, and generating a new data table;
And the new data table occupies a smaller storage space than the storage space occupied by the A during compression storage.
The application provides a data processing method, which comprises the following steps:
acquiring a data table A to be processed, wherein the A comprises a plurality of data columns, and one data column comprises at least one data block;
selecting a plurality of data columns meeting a third preset condition from the A;
Based on the similarity of the data blocks included in the plurality of data columns meeting the third preset condition, adjusting the sequence of the data columns with the similarity meeting the fourth preset condition, and generating a new data table;
And the new data table occupies a smaller storage space than the storage space occupied by the A during compression storage.
Compared with the prior art, the application has the following advantages:
The data processing method comprises the steps of obtaining a data table to be processed, selecting a first number of data columns from the data table to be processed according to the influence factors of data columns in the data table to be processed on the size of storage space occupied by the data table to be processed, selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar to each other, and adjusting the arrangement sequence of the second number of data columns in the data table to be processed to obtain a new data table, wherein the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed. By adopting the method provided by the application, the data columns influencing the storage space are selected, whether the data columns influencing the storage size are similar or not is further analyzed, and the storage space of the data table can be obviously reduced by adjusting the arrangement sequence among the data columns.
Drawings
Fig. 1 is a schematic diagram of an application scenario embodiment of a data processing method provided by the present application;
FIG. 2 is a flow chart of a data processing method according to a first embodiment of the present application;
FIG. 3 is a schematic diagram of a data processing apparatus according to a second embodiment of the present application;
FIG. 4 is a flow chart of a data processing method according to a third embodiment of the present application;
FIG. 5 is a flowchart of the operation of an application system according to a third embodiment of the present application;
FIG. 6 is a schematic diagram of a data processing apparatus according to a fourth embodiment of the present application;
FIG. 7 is a flowchart of a data processing method according to a seventh embodiment of the present application;
Fig. 8 is a flowchart of a data processing method according to an eighth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.
In order to enable those skilled in the art to better understand the present application, a specific application scenario embodiment of the present application will be described in detail first. Fig. 1 is a schematic diagram of an application scenario of a data processing method according to an embodiment of the present application. In a specific implementation process, a client side can send request information for acquiring a data table to be processed to a database server, after acquiring the data table to be processed, the client side selects a first number of data columns from the data table to be processed according to an influence factor of data columns in the data table to occupy storage space of the data table to be processed, and selects a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein at least one data block in the target data block and at least one data block in the first number of data columns are similar data blocks, the arrangement sequence between the second number of data columns is adjusted in the data table to be processed, a new data table is acquired, the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed, and the client side sends a request for storing the new data table to the database server.
The first embodiment of the application provides a data processing method. Referring to fig. 2, a flowchart of a first embodiment of the present application is shown. A data processing method according to a first embodiment of the present application is described in detail below with reference to fig. 1. The method comprises the following steps:
step S201, a data table to be processed is obtained.
The step is used for obtaining the data table to be processed.
A data table, a structured data file, may be used to store certain types of data. When data columns are stored in a nonvolatile storage medium, data tables are generally compressed in order to reduce storage space. For data tables of the same content, different data distributions are employed, and compression ratios obtained when storing to the nonvolatile storage medium are different. This is because the distribution of similar data blocks has an effect on the compression ratio when compressed.
With the rapid rise of cloud computing, a large number of data tables have been migrated to the cloud. Generally, a cloud platform charges according to the storage amount of a client, so in order to reduce the storage overhead, it is an urgent need to reduce the storage space of a data table.
In cloud data platforms currently in common use in the industry, data tables are collections of data, that is, different data distribution tables are equivalent for cloud data platforms. However, in actual storage, the friendliness of the compression algorithm is greatly different from different data distributions. For example, the storage space of the data table after sorting for different columns is not the same.
In general, there are one or more fields in each data table that have a relatively large impact on memory space, and these fields are key factors that affect compression. Wherein the average field length, the number of unique values are two extremely critical reference values.
The obtaining the data table to be processed comprises the following steps:
acquiring the size of a storage space occupied by a data table stored in a nonvolatile storage medium;
And acquiring the data table to be processed, which needs to compress the storage space, from the nonvolatile storage medium according to the size of the storage space occupied by the data table.
In order to quickly reduce the cloud storage space of a client, a data table with a relatively large storage space needs to be found as soon as possible. Firstly, acquiring the size of a storage space occupied by a data table in a nonvolatile storage medium stored on a cloud end, and then, sequencing according to the size of the storage space occupied by the data table, and acquiring the data table with larger occupied storage space from the nonvolatile storage medium as a data table to be processed.
Step S202, selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed.
The method comprises the step of selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed.
In this embodiment, selecting the first number of data columns may be selecting the first three data columns among the data columns that most affect the storage space of the data table.
The influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed comprise the average field length of the data columns in the data table to be processed;
Selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the storage size of the data table to be processed, wherein the method comprises the following steps:
and selecting a first number of data columns with average field length larger than a specified field length threshold from the data table to be processed.
For example, the field length threshold may be set to X bytes, then Y data columns having an average field length greater than X are selected from the pending data table.
And step 203, selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar data blocks.
The step is used for selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar data blocks.
For example, the first number of data columns may be the first three of the data columns that most affect the storage space of the data table. The following operations are performed on the target data blocks of the first number of data columns { A, B, C } (the set of data columns of the non-first number is { a, b, c., n })
Select
Count (distinct is not the first number of data columns a) as a _ cnt,
Count (distinct is not the first number of data columns b) as b _ cnt,
Count (distinct is not the first number of data columns c) as c_cnt.
Count (distinct non-first number of data columns n) as n_cnt
From Table t group by a first number of target data blocks of data column_a;
The column in which the minimum value of top2 is selected from a_cnt, b_cnt, c_cnt,..n_cnt, for example { non-first number of data columns a, non-first number of data columns b } is the key set { a, b } that has the greatest influence on the first number of data columns a
Select
Count (distinct is not the first number of data columns a) as a _ cnt,
Count (distinct is not the first number of data columns b) as b _ cnt,
Count (distinct is not the first number of data columns c) as c_cnt.
Count (distinct non-first number of data columns n) as n_cnt
From Table t group by a first number of target data blocks of data column_b;
the column in which the minimum value of top2 is located, such as { non-first number of data columns a, non-first number of data columns B } is selected from a_cnt, b_cnt, c_cnt, &..n_cnt, i.e., the key set { a, c } that has the greatest influence on the first number of data columns B.
The data processing method further comprises the following steps:
Obtaining a target data block in the first number of data columns;
the number of target data blocks in the first number of data columns is calculated.
The following is a detailed description of the above steps.
The obtaining the target data block in the first number of data columns includes:
performing word segmentation on the data blocks in the first number of data columns to obtain word segmentation results;
According to the word segmentation result, obtaining a feature vector of the data block;
Obtaining the distance between any two data blocks according to the feature vectors of the any two data blocks;
judging whether any two data blocks are similar data blocks or not according to the distance between any two data blocks;
and if the arbitrary two data blocks are similar data blocks, determining the arbitrary two data blocks as the target data block.
Firstly, word segmentation is carried out on the data blocks in the first number of data columns according to a specific format (such as space) to obtain word segmentation results, and then, word dictionaries are formed according to the number of all words contained after word segmentation, wherein each word corresponds to one vector element. Obtaining feature vectors of any two data blocks through single thermal coding according to the data blocks in the first number of data columns, obtaining the distance between the any two data blocks according to the feature vectors of the any two data blocks, judging whether the any two data blocks are similar data blocks or not according to the distance between the any two data blocks, for example, judging whether the distance between the any two data blocks is smaller than a preset distance threshold value or not, judging that the any two data blocks are similar data blocks or not if the distance between the any two data blocks is smaller than the preset distance threshold value, and finally, determining the any two data blocks as the target data blocks if the any two data blocks are similar data blocks.
The step of obtaining the feature vector of the data block according to the word segmentation result comprises the following steps:
according to the word segmentation result, the number of words contained in the segmented data block is obtained;
Generating a word dictionary after word segmentation according to the number of words contained in the data block after word segmentation;
and obtaining the feature vector of the data block according to the word dictionary after word segmentation.
For example, there are three text blocks:
I love China;
Father and mother love me;
3 father and mother love China;
after word segmentation, the following dictionary is generated:
1, I2, love 3, dad 4, mom 5, china;
And extracting the characteristics of the three text blocks by using one-hot, wherein the obtained characteristic vectors are respectively as follows:
i love China [1,1,0,0,1];
father and mother love me [1, 0];
Father and mother love China [0, 1];
The obtaining the distance between any two data blocks according to the feature vector of the any two data blocks comprises the following steps:
According to the feature vectors of any two data blocks, obtaining the Euclidean distance between any two data blocks or the cosine distance between any two data blocks;
And obtaining the distance between any two data blocks according to the Euclidean distance between any two data blocks or the cosine distance between any two data blocks.
For example, for two vectors X and Y, the remaining chordal distance dist (X, Y) is calculated as:
The calculation formula of the Euclidean distance sim (X, Y) is as follows:
By using the formula, the Euclidean distance between any two data blocks or the cosine distance between any two data blocks can be obtained. The euclidean distance or the cosine distance between any two data blocks can be used as the distance between any two data blocks.
Since the calculation of the euclidean distance and the cosine distance is a common calculation step, this is not illustrated here.
Judging whether any two data blocks are similar data blocks or not according to the distance between any two data blocks, including:
and if the distance between any two data blocks is smaller than the specified distance threshold value, determining that the any two data blocks are similar data blocks.
After finding similar data blocks, statistics can be performed by the following SQL statement,
Select count (col 1), count (col 2) & gt, count (coln) from table group by similar data block;
Wherein col1, col2, col3.. coln are data columns in the data sheet to be processed. After the statistics are completed, 3 columns with the smallest statistics can be found as the second number of data columns.
In the new data table, any one of the second number of data columns is adjacent to the other one of the second number of data columns.
For example, each of the second number of data columns may be centrally arranged together.
And S204, adjusting the arrangement sequence among the second number of data columns in the data table to be processed to obtain a new data table, wherein the storage space occupied by the new data table is smaller than that occupied by the data table to be processed.
The step is used for adjusting the arrangement sequence among the second number of data columns in the data table to be processed to obtain a new data table, and the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed.
The step of adjusting the arrangement sequence among the second number of data columns in the data table to be processed to obtain a new data table comprises the following steps:
Setting different arrangement sequences among the second number of data columns in the data table to be processed to obtain a plurality of new data tables;
The method further comprises the steps of:
Selecting a new data table with the smallest occupied storage space from the plurality of new data tables;
and storing the new data table with the smallest occupied storage space into a storage medium.
For example, the second number of data columns is 3 data columns A, B, C, and the arrangement order between ABC is adjusted, six arrangements and combinations corresponding to 6 new data tables can be obtained ABC, ACB, BAC, BCA, CAB, CBA. And selecting a new data table with the smallest occupied storage space from the 6 new data tables, and then storing the new data table with the smallest occupied storage space into a storage medium.
Judging whether the size of the storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed reaches or exceeds a storage space size threshold value;
Storing the new data table with the minimum occupied storage space into a storage medium comprises storing the new data table with the minimum occupied storage space into the storage medium if the size of the storage space saved by the new data table with the minimum occupied storage space relative to the data table to be processed reaches or exceeds a storage space size threshold value.
For example, if the storage space size threshold is 1.5G and the storage space occupied by the new data table with the smallest occupied storage space is 1.4G, the new data table with the smallest occupied storage space is stored in the storage medium.
The data processing method further comprises the step of maintaining the arrangement sequence among the data columns in the data table to be processed if the size of the storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed is lower than the storage space size threshold value.
For example, if the storage space size threshold is 1.5G, and the storage space occupied by the new data table with the smallest occupied storage space is 1.6G, the arrangement order among the data columns in the data table to be processed is maintained.
In the above embodiment, a data processing method is provided, and correspondingly, the application also provides a data processing device. Referring to FIG. 3, a flowchart of an embodiment of a data processing apparatus according to the present application is shown. Since this embodiment, i.e. the second embodiment, is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description of the method embodiment for relevant points. The device embodiments described below are merely illustrative.
The present application provides a data processing apparatus comprising:
an acquiring unit 301, configured to acquire a data table to be processed;
a first selecting unit 302, configured to select a first number of data columns from the to-be-processed data table according to an influence factor of the data columns in the to-be-processed data table on the size of the storage space occupied by the to-be-processed data table;
A second selecting unit 303, configured to select a second number of data columns from the to-be-processed data table according to the number of target data blocks in the first number of data columns, where the target data blocks and at least one data block in the first number of data columns are similar data blocks;
and the adjusting unit 304 is configured to adjust an arrangement order between the second number of data columns in the data table to be processed, so as to obtain a new data table, where a storage space occupied by the new data table is smaller than a storage space occupied by the data table to be processed.
In this embodiment, the acquiring unit is specifically configured to:
acquiring the size of a storage space occupied by a data table stored in a nonvolatile storage medium;
And acquiring the data table to be processed, which needs to compress the storage space, from the nonvolatile storage medium according to the size of the storage space occupied by the data table.
In this embodiment, the first selecting unit is specifically configured to:
and selecting a first number of data columns with average field length larger than a specified field length threshold from the data table to be processed.
In this embodiment, the data processing apparatus further includes a computing unit, where the computing unit is configured to:
Obtaining a target data block in the first number of data columns;
the number of target data blocks in the first number of data columns is calculated.
In this embodiment, the second selecting unit is specifically configured to:
performing word segmentation on the data blocks in the first number of data columns to obtain word segmentation results;
According to the word segmentation result, obtaining a feature vector of the data block;
Obtaining the distance between any two data blocks according to the feature vectors of the any two data blocks;
judging whether any two data blocks are similar data blocks or not according to the distance between any two data blocks;
and if the arbitrary two data blocks are similar data blocks, determining the arbitrary two data blocks as the target data block.
In this embodiment, the second selecting unit is further configured to:
according to the word segmentation result, the number of words contained in the segmented data block is obtained;
Generating a word dictionary after word segmentation according to the number of words contained in the data block after word segmentation;
and obtaining the feature vector of the data block according to the word dictionary after word segmentation.
In this embodiment, the second selecting unit is further configured to:
According to the feature vectors of any two data blocks, obtaining the Euclidean distance between any two data blocks or the cosine distance between any two data blocks;
And obtaining the distance between any two data blocks according to the Euclidean distance between any two data blocks or the cosine distance between any two data blocks.
In this embodiment, the second selecting unit is further configured to:
and if the distance between any two data blocks is smaller than the specified distance threshold value, determining that the any two data blocks are similar data blocks.
In this embodiment, in the new data table, any one of the second number of data columns is adjacent to the other one of the second number of data columns.
In this embodiment, the adjusting unit is specifically configured to:
Setting different arrangement sequences among the second number of data columns in the data table to be processed to obtain a plurality of new data tables;
the data processing apparatus further comprises a storage unit for:
Selecting a new data table with the smallest occupied storage space from the plurality of new data tables;
and storing the new data table with the smallest occupied storage space into a storage medium.
In this embodiment, the data processing apparatus further includes a determining unit, configured to determine whether a size of a storage space saved by the new data table with the smallest occupied storage space with respect to the data table to be processed reaches or exceeds a storage space size threshold;
the storage unit is used for storing the new data table with the minimum occupied storage space into a storage medium if the size of the storage space saved by the new data table with the minimum occupied storage space relative to the data table to be processed reaches or exceeds a storage space size threshold.
In this embodiment, the storage unit is further configured to maintain an arrangement order among the data columns in the data table to be processed if a size of the storage space saved by the new data table with the smallest occupied storage space relative to the data table to be processed is lower than the storage space size threshold.
The third embodiment of the present application provides a data processing method, and since the present embodiment has many repeated parts with the first embodiment of the present application, the repeated parts will not be described in detail, and reference should be made to the relevant parts of the first embodiment. Please refer to fig. 4, which is a flowchart of a data processing method according to the present embodiment. The data processing method comprises the following steps:
step S401, a data table to be processed is obtained.
The step is used for obtaining the data table to be processed.
Step S402, selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed.
The method comprises the step of selecting a first number of data columns from the data table to be processed according to the influence factors of the data columns in the data table to be processed on the size of the storage space occupied by the data table to be processed.
Step S403, selecting a second number of data columns from the to-be-processed data table according to the number of target data blocks in the first number of data columns, where the target data blocks and at least one data block in the first number of data columns are similar data blocks.
The step is used for selecting a second number of data columns from the data table to be processed according to the number of target data blocks in the first number of data columns, wherein the target data blocks and at least one data block in the first number of data columns are similar data blocks.
Step S404, selecting a third number of data columns with time attribute from the first number of data columns.
This step is used to select a third number of data columns having a temporal attribute from the first number of data columns.
The selecting a third number of data columns with time attribute from the first number of data columns includes:
and selecting a third number of data columns with time attributes from the first number of data columns according to the specified time regular expression.
A specific example of implementing a temporal regular expression using the JAVA language is given below.
Step 405, in the data table to be processed, for the second number of data columns and the third number of data columns, adjusting an arrangement sequence among the data columns to obtain a new data table, wherein a storage space occupied by the new data table is smaller than a storage space occupied by the data table to be processed.
The step is used for adjusting the arrangement sequence among the data columns in the data table to be processed aiming at the second number of data columns and the third number of data columns to obtain a new data table, wherein the storage space occupied by the new data table is smaller than the storage space occupied by the data table to be processed.
For example, if the second number of data columns is A, B, C columns and the third number of data columns is D, E columns, the arrangement order of the data columns is adjusted, and there is a factorial arrangement mode of ABCDEF, ABCDFE, ABCDEF and other 5. Each arrangement corresponds to a new data table.
Fig. 5 provides a workflow diagram of an application system employing the data processing method provided in this embodiment. The working steps of the application system comprise:
Step S501, selecting a data table in which a memory space needs to be reduced.
Step S502, selecting a top data column influencing the storage space size of the data table through data analysis exploration.
Step S503, automatically analyzing the attribute of the top data column related to the data column in step S402 through an algorithm module, wherein the method specifically comprises the following sub-steps:
Step S503-1, finding out the similar text block in the top column through a similar text block algorithm. Specific implementation steps reference is made to the first embodiment of the application.
Step S503-2, realizing similar text block aggregation by executing the following SQL statement on the similar text block aggregation:
select count (col 1), count (co 12) & gt, count (coln) from table group by similar text blocks;
Wherein col1, col2,.. coln, represent the top data columns, respectively.
Step S503-3, find out the columns with the smallest statistic value as the set A.
Step S503-4, selecting a column representing time as a set B by the time regular expression
S504, traversing and combining the columns in A and B, redistributing the data (distributing by al, an sort by al, an, bn), and comparing the storage space size of the redistributed data table with the storage space size of the original data table to obtain the result before and after comparison.
And S505, selecting a combination result with the best effect, and if the storage is obviously reduced (for example, more than 20%), reordering the data according to the data columns to reduce the storage size, otherwise, reducing the storage by the data table in such a way.
In the above embodiment, a data processing method is provided, and correspondingly, the application also provides a data processing device. Referring to FIG. 6, a flowchart of an embodiment of a data processing apparatus according to the present application is shown. Since this embodiment, the fourth embodiment, is substantially similar to the method embodiment, the description is relatively simple, and reference will be made to the partial explanation of the method embodiment for the relevant points. The device embodiments described below are merely illustrative.
The present application provides a data processing apparatus comprising:
a first obtaining unit 601, configured to obtain a data table to be processed;
a third selecting unit 602, configured to select a first number of data columns from the to-be-processed data table according to an influence factor of the data columns in the to-be-processed data table on the size of the storage space occupied by the to-be-processed data table;
A fourth selecting unit 603, configured to select a second number of data columns from the to-be-processed data table according to the number of target data blocks in the first number of data columns, where the target data blocks and at least one data block in the first number of data columns are similar data blocks;
A fifth selecting unit 604, configured to select a third number of data columns with a time attribute from the first number of data columns;
And a second adjusting unit 605, configured to adjust, in the to-be-processed data table, an arrangement sequence between the data columns according to the second number of data columns and the third number of data columns, so as to obtain a new data table, where a storage space occupied by the new data table is smaller than a storage space occupied by the to-be-processed data table.
Optionally, the fifth selecting unit is specifically configured to:
and selecting a third number of data columns with time attributes from the first number of data columns according to the specified time regular expression.
A fifth embodiment of the present application provides an electronic apparatus including:
A processor;
A kind of electronic device with a high-pressure air-conditioning system.
And a memory for storing a computer program, wherein the apparatus executes any one of the data processing methods according to the first and third embodiments of the present application after the computer program is executed by the processor.
A sixth embodiment of the present application provides a computer storage medium storing a computer program that is executed by a processor to perform any one of the data processing methods provided in the first and third embodiments of the present application.
Referring to fig. 7, a flowchart of a data processing method according to a seventh embodiment of the present application is shown. Since this embodiment is similar to the first and third embodiments of the present application, only a brief description will be given here. The method comprises the following steps:
Step 701, a data table A to be processed is obtained, wherein the A comprises a plurality of data columns, one data column comprises at least one data block, the at least one data column corresponds to an influence factor, and the influence factor represents an influence factor of one data column on the size of the storage space occupied by the A.
The method comprises the steps of acquiring a data table A to be processed, wherein the data table A comprises a plurality of data columns, one data column comprises at least one data block, the at least one data column corresponds to an influence factor, and the influence factor represents an influence factor of the data column on the size of the storage space occupied by the data table A.
The influence factor of the one data column on the size of the storage space occupied by the A comprises the average field length of the data column.
Step S702, selecting a plurality of data columns with the influence factors meeting a first preset condition from the A.
The step is used for selecting a plurality of data columns with the influence factors meeting a first preset condition from the A.
For example, the first three of the data columns that most affect the storage space of the data table are selected.
And step 703, adjusting the sequence of the data columns with the similarity meeting the second preset condition based on the similarity of the data blocks included in the plurality of data columns meeting the first preset condition, and generating a new data table, wherein the occupied storage space of the new data table is smaller than the storage space occupied by the A during compression storage.
For example, a cosine distance or a euclidean distance between data blocks included in a plurality of data columns is calculated, and the similarity of the data blocks included in the plurality of data columns is obtained. And adjusting the sequence of the data columns with the similarity meeting the second preset condition, and generating a new data table.
An eighth embodiment of the present application provides a data processing method, please refer to fig. 8, which is a flowchart of a data processing method according to a seventh embodiment of the present application. Since this embodiment is similar to the first and third embodiments of the present application, only a brief description will be given here. The data processing method comprises the following steps:
step S801, a data table A to be processed is obtained, wherein the A comprises a plurality of data columns, and one data column comprises at least one data block.
The method comprises the steps of obtaining a data table A to be processed, wherein the data table A comprises a plurality of data columns, and one data column comprises at least one data block.
Step S802, selecting a plurality of data columns meeting a third preset condition from the A.
The step is used for selecting a plurality of data columns meeting a third preset condition from the A.
And step 803, based on the similarity of the data blocks included in the plurality of data columns meeting the third preset condition, adjusting the sequence of the data columns with the similarity meeting the fourth preset condition, and generating a new data table, wherein the occupied storage space of the new data table is smaller than the storage space occupied by the A during compression storage.
The method comprises the steps of adjusting the sequence of data columns with the similarity meeting a fourth preset condition based on the similarity of data blocks included in the plurality of data columns to generate a new data table, wherein the occupied storage space of the new data table is smaller than that of the A during compression storage.
While the application has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the application as defined by the appended claims.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (16)

CN201910495929.9A2019-06-062019-06-06 Data processing method and deviceActiveCN112051965B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910495929.9ACN112051965B (en)2019-06-062019-06-06 Data processing method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910495929.9ACN112051965B (en)2019-06-062019-06-06 Data processing method and device

Publications (2)

Publication NumberPublication Date
CN112051965A CN112051965A (en)2020-12-08
CN112051965Btrue CN112051965B (en)2025-09-02

Family

ID=73609739

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910495929.9AActiveCN112051965B (en)2019-06-062019-06-06 Data processing method and device

Country Status (1)

CountryLink
CN (1)CN112051965B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105653698A (en)*2015-12-302016-06-08北京奇艺世纪科技有限公司Data loading method and apparatus for database table Hive Table
CN105653561A (en)*2014-12-022016-06-08阿里巴巴集团控股有限公司Data processing method and apparatus

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101976226B (en)*2010-10-202012-05-30青岛海信宽带多媒体技术有限公司Data storage method
US9116941B2 (en)*2013-03-152015-08-25International Business Machines CorporationReducing digest storage consumption by tracking similarity elements in a data deduplication system
CN103279542B (en)*2013-06-052018-05-22中国电子科技集团公司第十五研究所Data import processing method and data processing equipment
US10318514B2 (en)*2016-05-042019-06-11International Business Machines CorporationReorganizing a data table to improve analytical database performance
US10417237B2 (en)*2016-05-242019-09-17International Business Machines CorporationSorting tables in analytical databases
CN107784015B (en)*2016-08-302022-04-05中国电力科学研究院Data reduction method based on online historical data of power system
US10459657B2 (en)*2016-09-162019-10-29Hewlett Packard Enterprise Development LpStorage system with read cache-on-write buffer
CN106569750A (en)*2016-11-092017-04-19郑州云海信息技术有限公司Data compression method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105653561A (en)*2014-12-022016-06-08阿里巴巴集团控股有限公司Data processing method and apparatus
CN105653698A (en)*2015-12-302016-06-08北京奇艺世纪科技有限公司Data loading method and apparatus for database table Hive Table

Also Published As

Publication numberPublication date
CN112051965A (en)2020-12-08

Similar Documents

PublicationPublication DateTitle
US10423387B2 (en)Methods for highly efficient data sharding
CN109325032B (en)Index data storage and retrieval method, device and storage medium
CN106354827B (en)Media asset data integration method and system
CN104715039A (en)Column-based storage and research method and equipment based on hard disk and internal storage
US12223721B2 (en)Method and apparatus for video frame processing
US20170270162A1 (en)Query optimization method in distributed query engine and apparatus thereof
CN110825894A (en)Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
CN110826365A (en)Video fingerprint generation method and device
CN113240111A (en)Pruning method based on discrete cosine transform channel importance score
CN114297233A (en)Database query method, database query device, electronic device, medium, and program product
CN118051524A (en)Database index optimization method, device, equipment, medium and product
CN114443476B (en)Code review method and device
CN112051965B (en) Data processing method and device
CN114625903B (en) Image retrieval method and device, and image retrieval equipment
WO2025123839A1 (en)Data detection method and apparatus, object recommendation method and apparatus, and device and medium
KR20230081301A (en)A high-speed search method for similar images in a video and a server that performs the method.
US11816245B2 (en)Method for analysis on interim result data of de-identification procedure, apparatus for the same, computer program for the same, and recording medium storing computer program thereof
CN114328606B (en)Method, device and storage medium for improving SQL execution efficiency
CN106445960A (en)Data clustering method and device
CN105144139A (en)Generating a feature set
CN110377642B (en)Device for rapidly acquiring ordered sequence data
CN115297223A (en)Video processing method and device and electronic equipment
CN115455222A (en) Image retrieval method, device, computer equipment and computer-readable storage medium
CN114925067A (en)Data processing method, device, equipment and storage medium
US10277912B2 (en)Methods and apparatus for storing data related to video decoding

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp