Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data block placement method based on a heterogeneous Hadoop cluster environment, which comprises the following steps:
the method comprises the following steps that 1, data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-heat data blocks, normal data blocks and cold data blocks according to the frequency of the accessed data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and preset performance standards;
step 2, carrying out data block correlation analysis, and marking the data blocks with correlation in each classification of the data blocks;
step 3, executing a data block placement strategy, and placing each data block on data nodes of different classifications according to the different performance requirements according to the classifications of the data blocks and the data nodes;
step 4, judging whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when the data block placing strategy is executed, if so, executing step 3 again in the classification of the data nodes, and selecting other data nodes for placing;
and 5, finishing the placement of the current data block, and executing the step 3 again until all the data nodes are placed.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps of 1:
step 11, obtaining the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T by using a log collection tool, and obtaining the access frequency B _ f of the current period according to the balance factor τ and the access frequency B _ f (pre) of the previous period:
and step 12, calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block to measure the heat degree of the data block, and dividing the data block into a hot data block and a cold data block in sequence from high to low according to the heat degree.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps of:
step 21, according to the access frequency of each data block obtained in step a in different periods, using covariance cov between data blocks to perform correlation analysis:
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
and
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
and step 22, judging whether the covariance cov is positive, if so, indicating that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, indicating that the data blocks B1 and B2 do not have access correlation.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises the following steps:
the module 1 divides the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium-hot data blocks, normal data blocks and cold data blocks according to the access frequency of the data blocks, and classifies the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for analyzing the relevance of the data blocks and marking the data blocks with relevance in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on data nodes of different classifications according to different performance requirements according to the classifications of the data block and the data node;
the module 4 judges whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when executing the data block placing strategy, if so, the module 3 is re-executed in the classification of the data nodes, and other data nodes are selected and placed;
and a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises a module 1 and a module management module, wherein the module management module comprises:
the module 11 obtains, by using a log collection tool, the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T, and obtains, according to the balance factor τ and the access frequency B _ f (pre) of the previous period, the access frequency B _ f of the current period:
the module 12 calculates an average access frequency B _ f (avg) according to the access frequency of each period of the data block to measure the heat of the data block, and sequentially divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises the following modules 2:
the module 21 analyzes the correlation by using the covariance cov between the data blocks according to the access frequency of each data block in different periods obtained in the module a:
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
and
respectively generation by generationTable average access frequency of data blocks B1 and B2 over n cycles;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.
According to the scheme, the invention has the advantages that:
the invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
Detailed Description
The invention aims to provide a data block placement strategy based on a heterogeneous Hadoop cluster environment aiming at hot and cold data in the existing Hadoop cluster, and improve the execution performance and the resource utilization rate of the cluster.
Specifically, the present invention comprises the steps of:
A. and judging the cold and hot degree of the data block. The realization method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1, acquiring the number of read operations of each data block in the HDFS in a specified period T by a flash log collection tool, and recording as M, wherein the access frequency of each period may have a large contrast, so a balance factor τ is set, the access frequency of the previous period is recorded as B _ f (pre), and the access frequency B _ f of the current period is calculated as follows:
a1-2, the access frequency B _ f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula of B _ f (0) is as follows:
A2. calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B _ F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the sequence of the heat degree from high to low.
B. The data block correlation analysis is realized by the following method:
B1. a data block having an access dependency;
b1-1, the correlation here mainly refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, in the first case, a user accesses data block B1 and also accesses data block B2, in the second case, when the access frequency of data block B1 increases or decreases with a time period, data block B2 also has a linear change in the same direction, and based on the two cases, the invention refers to that data block B1 and data block B2 have correlation;
B2. a method of detecting correlation;
b2-1. according to the access frequency of each data block obtained in the step A in different periods, the correlation analysis is carried out by using covariance, for example, the correlation of the data blocks B1 and B2 is detected, and the following formula can be used for calculation:
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively,
and
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
b2-2, if the calculated covariance cov is a positive number, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, if the value of cov is 0, it indicates that the two data blocks are independent from each other, and if the value is a negative value, it indicates a negative correlation, which is not a research focus of the present invention, and the present invention mainly focuses on positive correlation, that is, the value of cov is a positive number;
b2-3. this kind of detection method is used to detect whether there is a correlation between two data blocks, but there may be more than two data blocks with correlation, so if data block B1 is correlated with data block B2 and data block B2 (or B1) is correlated with data block B3 during the detection process, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
b3-1, classifying the data blocks according to the heat degree in the step A (hot spot, medium heat, normal and cold data blocks), detecting the category of the data blocks with the correlation, traversing the categories to detect the correlation of the data blocks, if two or more data blocks have the correlation, establishing a correlation set C which is { B1, B2, …, Bn }, wherein n represents the number of the data blocks in the set, and each set takes the BID of the first data block as a mark;
C. the data node classification is realized by the following steps:
C1. the difference of hardware is mainly reflected on a CPU, a disk I/O, a network and a memory (because the memory resource is mainly reflected on the size of the memory, the difference between performances is small, and network transmission is not the key point of the research of the invention, the two items are not considered), but the classification standard of the invention mainly focuses on the CPU and the disk I/O;
C2. our classification is probably of several kinds: 1) machines with strong CPU and IO performance are called MAX types, 2) machines with strong CPU performance and general IO performance are called CPU types, 3) machines with strong IO performance and general IO performance are called IO types, 4) machines with strong IO performance and general IO performance are called CIM types, and 5) machines with weak IO performance are called CIB types;
D. the data block placement strategy based on the heat degree is realized by the following steps:
D1. in the step A, the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks, and four queues are generated according to the four classifications;
d1-1, hot spot data block queue B (h) { B1, B2, …, Bm }, where m is the number of data blocks; medium thermal data block queue B (m) { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) { B1, B2, …, Bk }, where k is the number of data blocks; b (c) { B1, B2, …, Bn }, n being the number of data blocks;
D2. obtaining several groups of data node queues according to the classification in the step C2;
d2-1.1) MAX class data node queue D (MAX) { D1, D2, …, Dm }, m being the number of data nodes; 2) the CPU class data node queue D (CPU) { D1, D2, …, Dn }, where n is the number of data nodes; 3) the IO class data node queue D (IO) { D1, D2, …, Dj }, j being the number of data nodes; 4) the CIM-class data node queue D (CIM) { D1, D2, …, Dk }, where k is the number of data nodes; 5) a CIB-type data node queue D (CIB) { D1, D2, …, Dl }, where l is the number of data nodes;
d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib); a more detailed description of the method is illustrated in fig. 2;
D3. the placement strategy takes into account data dependencies;
d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should place them dispersedly;
d3-2, obtaining the correlation set in step B, firstly inquiring whether there is a corresponding correlation set through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transport, most copies of a Data block are stored on the same chassis, but following the principle of cluster availability, two of the Data block copies must be stored on different chassis.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
A. Judging the cold and hot degree of the data block; B. analyzing the relevance of the data blocks; C. classifying the data nodes; D. a hot based data block placement policy. One specific implementation is as follows:
specifically, the present invention comprises the steps of:
A. and judging the cold and hot degree of the data block. The realization method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1, obtaining the number of read operations of each data block in HDFS in a specified period T by a Flume log collecting tool and recording as M, wherein the access frequency of each period may have greater contrast, thereby setting a balance factor
And the access frequency of the previous period is denoted as B _ f (pre), and the access frequency of the current period B _ f is calculated by the following formula:
a1-2, the access frequency B _ f (i) of the data block in the ith period can be derived according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula B _ f (i) is as follows:
A2. calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B _ F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the sequence of the heat degree from high to low.
B. The data block correlation analysis is realized by the following method:
B1. a data block having a dependency;
b1-1, the correlation here refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, the first case is that a user accesses data block B1 and also accesses data block B2, and the second case is that data block B2 also has linear change in the same direction when the access frequency of data block B1 increases or decreases with time period, based on which we call data block B1 and data block B2 have correlation;
B2. a method of detecting correlation;
b2-1. according to the access frequency of each data block obtained in the step A in different periods, the correlation analysis is carried out by using covariance, for example, the correlation of the data blocks B1 and B2 is detected, and the following formula can be used for calculation:
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively,
and
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
b2-2, if the calculated covariance cov is a positive number, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, if the value of cov is 0, it indicates that the two data blocks are independent from each other, and if the value is a negative value, it indicates a negative correlation, which is not a research focus of the present invention, and the present invention mainly focuses on positive correlation, that is, the value of cov is a positive number;
b2-3. this kind of detection method is used to detect whether there is a correlation between two data blocks, but there may be more than two data blocks with correlation, so if data block B1 is correlated with data block B2 and data block B2 (or B1) is correlated with data block B3 during the detection process, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
b3-1, classifying the data blocks according to the heat degree in the step A (hot spot, medium heat, normal and cold data blocks), detecting the category of the data blocks with the correlation, traversing the categories to detect the correlation of the data blocks, if two or more data blocks have the correlation, establishing a correlation set C which is { B1, B2, …, Bn }, wherein n represents the number of the data blocks in the set, and each set takes the BID of the first data block as a mark;
C. the data node classification is realized by the following steps:
C1. the difference of hardware is mainly reflected on a CPU, a disk I/O, a network and a memory (because the memory resource is mainly reflected on the size of the memory, the difference between performances is small, and network transmission is not the key point of the research of the invention, the two items are not considered), but the classification standard of the invention mainly focuses on the CPU and the disk I/O;
C2. our classification is probably of several kinds: 1) machines with strong CPU and IO performance are called MAX types, 2) machines with strong CPU performance and general IO performance are called CPU types, 3) machines with strong IO performance and general IO performance are called IO types, 4) machines with strong IO performance and general IO performance are called CIM types, and 5) machines with weak IO performance are called CIB types;
D. the data block placement strategy based on the heat degree is realized by the following steps:
D1. in the step A, the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks, and four queues are generated according to the four classifications;
d1-1, hot spot data block queue B (h) { B1, B2, …, Bm }, where m is the number of data blocks; medium thermal data block queue B (m) { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) { B1, B2, …, Bk }, where k is the number of data blocks; b (c) { B1, B2, …, Bn }, n being the number of data blocks;
D2. obtaining several groups of data node queues according to the classification in the step C2;
d2-1.1) MAX class data node queue D (MAX) { D1, D2, …, Dm }, m being the number of data nodes; 2) the CPU class data node queue D (CPU) { D1, D2, …, Dn }, where n is the number of data nodes; 3) the IO class data node queue D (IO) { D1, D2, …, Dj }, j being the number of data nodes; 4) the CIM-class data node queue D (CIM) { D1, D2, …, Dk }, where k is the number of data nodes; 5) a CIB-type data node queue D (CIB) { D1, D2, …, Dl }, where l is the number of data nodes;
d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib);
D3. the placement strategy takes into account data dependencies;
d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should copy them dispersedly;
d3-2, obtaining the correlation set in step B, firstly inquiring whether there is a corresponding correlation set through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transport, most copies of a Data block are stored on the same chassis, but following the principle of cluster availability, two of the Data block copies must be stored on different chassis.
The invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. The placing strategy provided by the invention improves the execution performance of the cluster and the utilization rate of resources. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises the following steps:
the module 1 divides the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium-hot data blocks, normal data blocks and cold data blocks according to the access frequency of the data blocks, and classifies the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for analyzing the relevance of the data blocks and marking the data blocks with relevance in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on data nodes of different classifications according to different performance requirements according to the classifications of the data block and the data node;
the module 4 judges whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when executing the data block placing strategy, if so, the module 3 is re-executed in the classification of the data nodes, and other data nodes are selected and placed;
and a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises a module 1 and a module management module, wherein the module management module comprises:
the module 11 obtains, by using a log collection tool, the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T, and obtains, according to the balance factor τ and the access frequency B _ f (pre) of the previous period, the access frequency B _ f of the current period:
the module 12 calculates an average access frequency B _ f (avg) according to the access frequency of each period of the data block to measure the heat of the data block, and sequentially divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises the following modules 2:
the module 21 analyzes the correlation by using the covariance cov between the data blocks according to the access frequency of each data block in different periods obtained in the module a:
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
and
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.