and step 12, calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block to measure the heat degree of the data block, and dividing the data block into a hot data block and a cold data block in sequence from high to low according to the heat degree.

The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps of:

step 21, according to the access frequency of each data block obtained in step a in different periods, using covariance cov between data blocks to perform correlation analysis:

where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,

and

represents the average access frequency of the data blocks B1 and B2 in n periods respectively;

and step 22, judging whether the covariance cov is positive, if so, indicating that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, indicating that the data blocks B1 and B2 do not have access correlation.

The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises the following steps:

the module 1 divides the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium-hot data blocks, normal data blocks and cold data blocks according to the access frequency of the data blocks, and classifies the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;

the module 2 is used for analyzing the relevance of the data blocks and marking the data blocks with relevance in each classification of the data blocks;

the module 3 executes a data block placement strategy, and places each data block on data nodes of different classifications according to different performance requirements according to the classifications of the data block and the data node;

the module 4 judges whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when executing the data block placing strategy, if so, the module 3 is re-executed in the classification of the data nodes, and other data nodes are selected and placed;

and a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed.

The data block placement system based on the heterogeneous Hadoop cluster environment comprises a module 1 and a module management module, wherein the module management module comprises:

the module 11 obtains, by using a log collection tool, the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T, and obtains, according to the balance factor τ and the access frequency B _ f (pre) of the previous period, the access frequency B _ f of the current period:

the module 12 calculates an average access frequency B _ f (avg) according to the access frequency of each period of the data block to measure the heat of the data block, and sequentially divides the data block into a hot data block and a cold data block according to the heat from high to low.

The data block placement system based on the heterogeneous Hadoop cluster environment comprises the following modules 2:

the module 21 analyzes the correlation by using the covariance cov between the data blocks according to the access frequency of each data block in different periods obtained in the module a:

and

respectively generation by generationTable average access frequency of data blocks B1 and B2 over n cycles;

the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.

According to the scheme, the invention has the advantages that:

the invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.

Drawings

FIG. 1 is an overall flow chart of a data block placement strategy based on a heterogeneous Hadoop cluster environment;

fig. 2 is a detailed flow chart of a data block placement strategy.

Detailed Description

The invention aims to provide a data block placement strategy based on a heterogeneous Hadoop cluster environment aiming at hot and cold data in the existing Hadoop cluster, and improve the execution performance and the resource utilization rate of the cluster.

Specifically, the present invention comprises the steps of:

A. and judging the cold and hot degree of the data block. The realization method comprises the following steps:

A1. calculating the access frequency of each data block;

a1-1, acquiring the number of read operations of each data block in the HDFS in a specified period T by a flash log collection tool, and recording as M, wherein the access frequency of each period may have a large contrast, so a balance factor τ is set, the access frequency of the previous period is recorded as B _ f (pre), and the access frequency B _ f of the current period is calculated as follows:

a1-2, the access frequency B _ f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula of B _ f (0) is as follows:

A2. calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block obtained in the step;

A3. and B _ F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the sequence of the heat degree from high to low.

B. The data block correlation analysis is realized by the following method:

B1. a data block having an access dependency;

b1-1, the correlation here mainly refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, in the first case, a user accesses data block B1 and also accesses data block B2, in the second case, when the access frequency of data block B1 increases or decreases with a time period, data block B2 also has a linear change in the same direction, and based on the two cases, the invention refers to that data block B1 and data block B2 have correlation;

B2. a method of detecting correlation;

b2-1. according to the access frequency of each data block obtained in the step A in different periods, the correlation analysis is carried out by using covariance, for example, the correlation of the data blocks B1 and B2 is detected, and the following formula can be used for calculation:

where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively,

and

b2-2, if the calculated covariance cov is a positive number, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, if the value of cov is 0, it indicates that the two data blocks are independent from each other, and if the value is a negative value, it indicates a negative correlation, which is not a research focus of the present invention, and the present invention mainly focuses on positive correlation, that is, the value of cov is a positive number;

b2-3. this kind of detection method is used to detect whether there is a correlation between two data blocks, but there may be more than two data blocks with correlation, so if data block B1 is correlated with data block B2 and data block B2 (or B1) is correlated with data block B3 during the detection process, then data blocks B1, B2, B3 are all correlated;

B3. creating a data block set of the correlation and marking;

b3-1, classifying the data blocks according to the heat degree in the step A (hot spot, medium heat, normal and cold data blocks), detecting the category of the data blocks with the correlation, traversing the categories to detect the correlation of the data blocks, if two or more data blocks have the correlation, establishing a correlation set C which is { B1, B2, …, Bn }, wherein n represents the number of the data blocks in the set, and each set takes the BID of the first data block as a mark;

C. the data node classification is realized by the following steps:

C1. the difference of hardware is mainly reflected on a CPU, a disk I/O, a network and a memory (because the memory resource is mainly reflected on the size of the memory, the difference between performances is small, and network transmission is not the key point of the research of the invention, the two items are not considered), but the classification standard of the invention mainly focuses on the CPU and the disk I/O;

C2. our classification is probably of several kinds: 1) machines with strong CPU and IO performance are called MAX types, 2) machines with strong CPU performance and general IO performance are called CPU types, 3) machines with strong IO performance and general IO performance are called IO types, 4) machines with strong IO performance and general IO performance are called CIM types, and 5) machines with weak IO performance are called CIB types;

D. the data block placement strategy based on the heat degree is realized by the following steps:

D1. in the step A, the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks, and four queues are generated according to the four classifications;

d1-1, hot spot data block queue B (h) { B1, B2, …, Bm }, where m is the number of data blocks; medium thermal data block queue B (m) { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) { B1, B2, …, Bk }, where k is the number of data blocks; b (c) { B1, B2, …, Bn }, n being the number of data blocks;

D2. obtaining several groups of data node queues according to the classification in the step C2;

d2-1.1) MAX class data node queue D (MAX) { D1, D2, …, Dm }, m being the number of data nodes; 2) the CPU class data node queue D (CPU) { D1, D2, …, Dn }, where n is the number of data nodes; 3) the IO class data node queue D (IO) { D1, D2, …, Dj }, j being the number of data nodes; 4) the CIM-class data node queue D (CIM) { D1, D2, …, Dk }, where k is the number of data nodes; 5) a CIB-type data node queue D (CIB) { D1, D2, …, Dl }, where l is the number of data nodes;

d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib); a more detailed description of the method is illustrated in fig. 2;

D3. the placement strategy takes into account data dependencies;

d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should place them dispersedly;

d3-2, obtaining the correlation set in step B, firstly inquiring whether there is a corresponding correlation set through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;

D4. considering the problems of Data localization (Data Locality) and network transport, most copies of a Data block are stored on the same chassis, but following the principle of cluster availability, two of the Data block copies must be stored on different chassis.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

A. Judging the cold and hot degree of the data block; B. analyzing the relevance of the data blocks; C. classifying the data nodes; D. a hot based data block placement policy. One specific implementation is as follows:

specifically, the present invention comprises the steps of:

A1. calculating the access frequency of each data block;

a1-1, obtaining the number of read operations of each data block in HDFS in a specified period T by a Flume log collecting tool and recording as M, wherein the access frequency of each period may have greater contrast, thereby setting a balance factor

And the access frequency of the previous period is denoted as B _ f (pre), and the access frequency of the current period B _ f is calculated by the following formula:

a1-2, the access frequency B _ f (i) of the data block in the ith period can be derived according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula B _ f (i) is as follows:

B. The data block correlation analysis is realized by the following method:

B1. a data block having a dependency;

b1-1, the correlation here refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, the first case is that a user accesses data block B1 and also accesses data block B2, and the second case is that data block B2 also has linear change in the same direction when the access frequency of data block B1 increases or decreases with time period, based on which we call data block B1 and data block B2 have correlation;

B2. a method of detecting correlation;

and

B3. creating a data block set of the correlation and marking;

C. the data node classification is realized by the following steps:

d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib);

D3. the placement strategy takes into account data dependencies;

d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should copy them dispersedly;

The invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. The placing strategy provided by the invention improves the execution performance of the cluster and the utilization rate of resources. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

and

Claims

Translated fromChinese

1.一种基于异构Hadoop集群环境的数据块放置方法，其特征在于，包括：1. a data block placement method based on heterogeneous Hadoop cluster environment, is characterized in that, comprises:

步骤1、根据数据块被访问的频率，将存储于异构集群环境中的数据块分为热点数据块、中热数据块、正常数据块和冷门数据块，根据异构集群环境中各数据节点的性能和预设的性能标准，将异构集群环境中数据节点按性能的不同进行分类；Step 1. According to the frequency of data blocks being accessed, the data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-hot data blocks, normal data blocks and unpopular data blocks. According to the performance and preset performance standards, the data nodes in the heterogeneous cluster environment are classified according to their performance;

步骤3、执行数据块放置策略，根据数据块和数据节点的分类，将各数据块按照性能需求的不同放置到不同分类的数据节点上；Step 3. Execute the data block placement strategy, and place each data block on data nodes of different classifications according to different performance requirements according to the classification of data blocks and data nodes;

步骤4、执行数据块放置策略时判断当前数据块选择放置的数据节点中是否有与它具有相关性的其他数据块，若有则在此数据节点的分类中重新执行步骤3，选择其他数据节点进行放置；Step 4. When executing the data block placement strategy, determine whether there are other data blocks related to it in the data node that the current data block is selected to place, and if so, re-execute step 3 in the classification of this data node, and select other data nodes to place;

步骤5、完成当前数据块的放置，再次执行步骤3，直到所有的数据节点完成放置。Step 5. Complete the placement of the current data block, and perform step 3 again until all data nodes are placed.

2.如权利要求1所述的基于异构Hadoop集群环境的数据块放置方法，其特征在于，该步骤1包括：2. The method for placing data blocks based on a heterogeneous Hadoop cluster environment as claimed in claim 1, wherein the step 1 comprises:

步骤11、通过日志收集工具获取规定周期T内异构集群环境中各数据块的读操作次数M，根据平衡因子τ，上一周期的访问频率B_f(pre)，得到当前周期的访问频率B_f：Step 11. Obtain the number of read operations M of each data block in the heterogeneous cluster environment within the specified period T through the log collection tool, and obtain the access frequency B_f of the current period according to the balance factor τ, the access frequency B_f(pre) of the previous period:

步骤12、根据数据块各周期的访问频率计算出平均访问频率B_F(avg)，以衡量数据块的热度，按照热度由高到低依次将数据块分为热点数据块和冷门数据块。Step 12: Calculate the average access frequency B_F(avg) according to the access frequency of each cycle of the data block to measure the heatness of the data block, and divide the data blocks into hot data blocks and unpopular data blocks in descending order of heatness.

3.如权利要求2所述的基于异构Hadoop集群环境的数据块放置方法，其特征在于，该步骤2包括：3. The data block placement method based on heterogeneous Hadoop cluster environment as claimed in claim 2, is characterized in that, this step 2 comprises:

步骤21、根据步骤A中获得的各数据块在不同周期的访问频率，利用数据块间的协方差cov进行相关性的分析：Step 21. According to the access frequency of each data block obtained in step A in different periods, use the covariance cov between the data blocks to analyze the correlation:

其中n为周期数，i为当前周期，X和Y分别代表数据块B1和B2的在当前周期的访问频率，

和

分别代表n个周期内数据块B1和B2的平均访问频率；where n is the number of cycles, i is the current cycle, X and Y respectively represent the access frequency of the data blocks B1 and B2 in the current cycle,

and

respectively represent the average access frequency of data blocks B1 and B2 in n cycles;

步骤22、判断该协方差cov是否为为正数，若是则表明两个数据块访问频率的变化趋势一致，数据块B1和B2具有访问相关性，否则表明这数据块B1和B2不具有访问相关性。Step 22, determine whether the covariance cov is a positive number, if so, it indicates that the change trend of the access frequency of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise it indicates that the data blocks B1 and B2 do not have access correlation. sex.

4.一种基于异构Hadoop集群环境的数据块放置系统，其特征在于，包括：4. a data block placement system based on heterogeneous Hadoop cluster environment, is characterized in that, comprises:

模块1、根据数据块被访问的频率，将存储于异构集群环境中的数据块分为热点数据块、中热数据块、正常数据块和冷门数据块，根据异构集群环境中各数据节点的性能和预设的性能标准，将异构集群环境中数据节点按性能的不同进行分类；Module 1. According to the frequency of data blocks being accessed, the data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-hot data blocks, normal data blocks and unpopular data blocks. According to the data nodes in the heterogeneous cluster environment According to the performance and preset performance standards, the data nodes in the heterogeneous cluster environment are classified according to their different performances;

模块2、进行数据块相关性分析，将数据块各分类中具有相关性的数据块进行标记；Module 2, perform data block correlation analysis, and mark data blocks with correlation in each classification of data blocks;

模块3、执行数据块放置策略，根据数据块和数据节点的分类，将各数据块按照性能需求的不同放置到不同分类的数据节点上；Module 3. Execute the data block placement strategy, according to the classification of data blocks and data nodes, place each data block on data nodes of different classifications according to different performance requirements;

模块4、执行数据块放置策略时判断当前数据块选择放置的数据节点中是否有与它具有相关性的其他数据块，若有则在此数据节点的分类中重新执行模块3，选择其他数据节点进行放置；Module 4. When executing the data block placement strategy, determine whether there are other data blocks related to it in the data node selected and placed by the current data block. If there is, re-execute module 3 in the classification of this data node, and select other data nodes. to place;

模块5、完成当前数据块的放置，再次执行模块3，直到所有的数据节点完成放置。Module 5. Complete the placement of the current data block, and execute module 3 again until all data nodes are placed.

5.如权利要求1所述的基于异构Hadoop集群环境的数据块放置系统，其特征在于，该模块1包括：5. The system for placing data blocks based on a heterogeneous Hadoop cluster environment as claimed in claim 1, wherein the module 1 comprises:

模块11、通过日志收集工具获取规定周期T内异构集群环境中各数据块的读操作次数M，根据平衡因子τ，上一周期的访问频率B_f(pre)，得到当前周期的访问频率B_f：Module 11: Obtain the number of read operations M of each data block in the heterogeneous cluster environment within the specified period T through the log collection tool, and obtain the access frequency B_f of the current period according to the balance factor τ, the access frequency B_f(pre) of the previous period:

模块12、根据数据块各周期的访问频率计算出平均访问频率B_F(avg)，以衡量数据块的热度，按照热度由高到低依次将数据块分为热点数据块和冷门数据块。Module 12: Calculate the average access frequency B_F(avg) according to the access frequency of each cycle of the data block to measure the heat of the data block, and divide the data blocks into hot data blocks and unpopular data blocks according to the heat from high to low.

6.如权利要求2所述的基于异构Hadoop集群环境的数据块放置系统，其特征在于，该模块2包括：6. The data block placement system based on heterogeneous Hadoop cluster environment as claimed in claim 2, is characterized in that, this module 2 comprises:

模块21、根据模块A中获得的各数据块在不同周期的访问频率，利用数据块间的协方差cov进行相关性的分析：Module 21. According to the access frequency of each data block obtained in module A in different periods, use the covariance cov between the data blocks to analyze the correlation:

和

分别代表n个周期内数据块B1和B2的平均访问频率；where n is the number of cycles, i is the current cycle, X and Y represent the access frequencies of data blocks B1 and B2 in the current cycle, respectively,

and

模块22、判断该协方差cov是否为为正数，若是则表明两个数据块访问频率的变化趋势一致，数据块B1和B2具有访问相关性，否则表明这数据块B1和B2不具有访问相关性。Module 22. Determine whether the covariance cov is a positive number. If so, it indicates that the change trend of the access frequency of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise it indicates that the data blocks B1 and B2 do not have access correlation. sex.