Movatterモバイル変換


[0]ホーム

URL:


CN113407620A - Data block placement method and system based on heterogeneous Hadoop cluster environment - Google Patents

Data block placement method and system based on heterogeneous Hadoop cluster environment
Download PDF

Info

Publication number
CN113407620A
CN113407620ACN202010185518.2ACN202010185518ACN113407620ACN 113407620 ACN113407620 ACN 113407620ACN 202010185518 ACN202010185518 ACN 202010185518ACN 113407620 ACN113407620 ACN 113407620A
Authority
CN
China
Prior art keywords
data
data blocks
data block
access frequency
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010185518.2A
Other languages
Chinese (zh)
Other versions
CN113407620B (en
Inventor
宋�莹
许家豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Best Innovation Beijing Technology Co ltd
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology UniversityfiledCriticalBeijing Information Science and Technology University
Priority to CN202010185518.2ApriorityCriticalpatent/CN113407620B/en
Publication of CN113407620ApublicationCriticalpatent/CN113407620A/en
Application grantedgrantedCritical
Publication of CN113407620BpublicationCriticalpatent/CN113407620B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention provides a data block placement method and a data block placement system based on a heterogeneous Hadoop cluster environment. The placing strategy provided by the invention improves the execution performance of the cluster and the utilization rate of resources.

Description

Data block placement method and system based on heterogeneous Hadoop cluster environment
Technical Field
The invention relates to copy replication for improving cluster performance aiming at the data block cold and hot degree in a Hadoop cluster, and belongs to the field of distributed computing.
Background
With the continuous development of internet technology, we have entered the era of big data, so the application of big data related technology should be more extensive and deeper. Hadoop is the most popular big data open source framework at present, is a big data platform capable of processing massive data in an offline and parallel mode, has the characteristics of high reliability, high expandability, high efficiency, low cost, open source and the like, and is called a preferred massive data processing scheme of a plurality of Internet companies. Hadoop mainly comprises a Hadoop Distributed File System (HDFS) and a MapReduce distributed computing framework, and Hadoop is developed to be mature so far, but some aspects have defects and need improvement and optimization.
The HDFS stores many files, including large files and small files (large files), wherein the large files are composed of a plurality of data blocks, and the small files only occupy one part of one data block. The degree of heat of the data block is measured by the access frequency of a user to the data block, and the higher the access frequency is, the higher the degree of heat of the data block is, so that hot spot data (data with high access frequency) and cold data (data with low access frequency) exist. For hot spot data, it is data that users often access, which brings about two problems: 1) because the access frequency of the hotspot data is high, the hotspot data can be accessed by a plurality of users at the same time, and the burden of the node is increased; 2) the hot data belongs to data frequently accessed by a user and needs to meet the user experience in response time. Both of the above problems are the problems faced by conventional Hadoop.
The design of the traditional Hadoop system is oriented to a homogeneous computing environment and consists of a group of machines with the same configuration, each node has the same storage performance and disk capacity under the homogeneous cluster, when data are written into an HDFS (Hadoop distributed file system), the data can be divided into a plurality of blocks with the same size, and then the Hadoop can balance and equally load the data blocks onto each node in a random distribution mode. However, clusters running Hadoop are often heterogeneous computing environments at present, the data stored in Hadoop are different in hot degree, hot data are often accessed, the number of users accessing the data is large, the nodes storing the data need to have high storage performance, and cold data are rarely accessed or are not accessed and only need to be stored. Therefore, the traditional homogeneous cluster of Hadoop has no high efficiency and practicability for the heat problem of data.
The Hadoop default copy strategy has certain defects in aspects of user requirements, storage performance, system resources and the like. In the heterogeneous cluster environment, problems exist, such as low system resource utilization, unbalanced node load, low fault tolerance, network transmission and communication payload, which may even cause failures.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data block placement method based on a heterogeneous Hadoop cluster environment, which comprises the following steps:
the method comprises the following steps that 1, data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-heat data blocks, normal data blocks and cold data blocks according to the frequency of the accessed data blocks, and the data nodes in the heterogeneous cluster environment are classified according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and preset performance standards;
step 2, carrying out data block correlation analysis, and marking the data blocks with correlation in each classification of the data blocks;
step 3, executing a data block placement strategy, and placing each data block on data nodes of different classifications according to the different performance requirements according to the classifications of the data blocks and the data nodes;
step 4, judging whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when the data block placing strategy is executed, if so, executing step 3 again in the classification of the data nodes, and selecting other data nodes for placing;
and 5, finishing the placement of the current data block, and executing the step 3 again until all the data nodes are placed.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps of 1:
step 11, obtaining the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T by using a log collection tool, and obtaining the access frequency B _ f of the current period according to the balance factor τ and the access frequency B _ f (pre) of the previous period:
Figure BDA0002414041590000021
and step 12, calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block to measure the heat degree of the data block, and dividing the data block into a hot data block and a cold data block in sequence from high to low according to the heat degree.
The data block placement method based on the heterogeneous Hadoop cluster environment comprises the following steps of:
step 21, according to the access frequency of each data block obtained in step a in different periods, using covariance cov between data blocks to perform correlation analysis:
Figure BDA0002414041590000031
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
Figure BDA0002414041590000032
and
Figure BDA0002414041590000033
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
and step 22, judging whether the covariance cov is positive, if so, indicating that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, indicating that the data blocks B1 and B2 do not have access correlation.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises the following steps:
the module 1 divides the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium-hot data blocks, normal data blocks and cold data blocks according to the access frequency of the data blocks, and classifies the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for analyzing the relevance of the data blocks and marking the data blocks with relevance in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on data nodes of different classifications according to different performance requirements according to the classifications of the data block and the data node;
the module 4 judges whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when executing the data block placing strategy, if so, the module 3 is re-executed in the classification of the data nodes, and other data nodes are selected and placed;
and a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises a module 1 and a module management module, wherein the module management module comprises:
the module 11 obtains, by using a log collection tool, the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T, and obtains, according to the balance factor τ and the access frequency B _ f (pre) of the previous period, the access frequency B _ f of the current period:
Figure BDA0002414041590000041
the module 12 calculates an average access frequency B _ f (avg) according to the access frequency of each period of the data block to measure the heat of the data block, and sequentially divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises the following modules 2:
the module 21 analyzes the correlation by using the covariance cov between the data blocks according to the access frequency of each data block in different periods obtained in the module a:
Figure BDA0002414041590000042
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
Figure BDA0002414041590000043
and
Figure BDA0002414041590000044
respectively generation by generationTable average access frequency of data blocks B1 and B2 over n cycles;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.
According to the scheme, the invention has the advantages that:
the invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. By the placement strategy provided by the invention, the execution performance of the cluster and the utilization rate of resources are improved. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
Drawings
FIG. 1 is an overall flow chart of a data block placement strategy based on a heterogeneous Hadoop cluster environment;
fig. 2 is a detailed flow chart of a data block placement strategy.
Detailed Description
The invention aims to provide a data block placement strategy based on a heterogeneous Hadoop cluster environment aiming at hot and cold data in the existing Hadoop cluster, and improve the execution performance and the resource utilization rate of the cluster.
Specifically, the present invention comprises the steps of:
A. and judging the cold and hot degree of the data block. The realization method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1, acquiring the number of read operations of each data block in the HDFS in a specified period T by a flash log collection tool, and recording as M, wherein the access frequency of each period may have a large contrast, so a balance factor τ is set, the access frequency of the previous period is recorded as B _ f (pre), and the access frequency B _ f of the current period is calculated as follows:
Figure BDA0002414041590000051
a1-2, the access frequency B _ f (i) of the data block in the ith period can be deduced according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula of B _ f (0) is as follows:
Figure BDA0002414041590000052
A2. calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B _ F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the sequence of the heat degree from high to low.
B. The data block correlation analysis is realized by the following method:
B1. a data block having an access dependency;
b1-1, the correlation here mainly refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, in the first case, a user accesses data block B1 and also accesses data block B2, in the second case, when the access frequency of data block B1 increases or decreases with a time period, data block B2 also has a linear change in the same direction, and based on the two cases, the invention refers to that data block B1 and data block B2 have correlation;
B2. a method of detecting correlation;
b2-1. according to the access frequency of each data block obtained in the step A in different periods, the correlation analysis is carried out by using covariance, for example, the correlation of the data blocks B1 and B2 is detected, and the following formula can be used for calculation:
Figure BDA0002414041590000061
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively,
Figure BDA0002414041590000062
and
Figure BDA0002414041590000063
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
b2-2, if the calculated covariance cov is a positive number, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, if the value of cov is 0, it indicates that the two data blocks are independent from each other, and if the value is a negative value, it indicates a negative correlation, which is not a research focus of the present invention, and the present invention mainly focuses on positive correlation, that is, the value of cov is a positive number;
b2-3. this kind of detection method is used to detect whether there is a correlation between two data blocks, but there may be more than two data blocks with correlation, so if data block B1 is correlated with data block B2 and data block B2 (or B1) is correlated with data block B3 during the detection process, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
b3-1, classifying the data blocks according to the heat degree in the step A (hot spot, medium heat, normal and cold data blocks), detecting the category of the data blocks with the correlation, traversing the categories to detect the correlation of the data blocks, if two or more data blocks have the correlation, establishing a correlation set C which is { B1, B2, …, Bn }, wherein n represents the number of the data blocks in the set, and each set takes the BID of the first data block as a mark;
C. the data node classification is realized by the following steps:
C1. the difference of hardware is mainly reflected on a CPU, a disk I/O, a network and a memory (because the memory resource is mainly reflected on the size of the memory, the difference between performances is small, and network transmission is not the key point of the research of the invention, the two items are not considered), but the classification standard of the invention mainly focuses on the CPU and the disk I/O;
C2. our classification is probably of several kinds: 1) machines with strong CPU and IO performance are called MAX types, 2) machines with strong CPU performance and general IO performance are called CPU types, 3) machines with strong IO performance and general IO performance are called IO types, 4) machines with strong IO performance and general IO performance are called CIM types, and 5) machines with weak IO performance are called CIB types;
D. the data block placement strategy based on the heat degree is realized by the following steps:
D1. in the step A, the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks, and four queues are generated according to the four classifications;
d1-1, hot spot data block queue B (h) { B1, B2, …, Bm }, where m is the number of data blocks; medium thermal data block queue B (m) { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) { B1, B2, …, Bk }, where k is the number of data blocks; b (c) { B1, B2, …, Bn }, n being the number of data blocks;
D2. obtaining several groups of data node queues according to the classification in the step C2;
d2-1.1) MAX class data node queue D (MAX) { D1, D2, …, Dm }, m being the number of data nodes; 2) the CPU class data node queue D (CPU) { D1, D2, …, Dn }, where n is the number of data nodes; 3) the IO class data node queue D (IO) { D1, D2, …, Dj }, j being the number of data nodes; 4) the CIM-class data node queue D (CIM) { D1, D2, …, Dk }, where k is the number of data nodes; 5) a CIB-type data node queue D (CIB) { D1, D2, …, Dl }, where l is the number of data nodes;
d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib); a more detailed description of the method is illustrated in fig. 2;
D3. the placement strategy takes into account data dependencies;
d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should place them dispersedly;
d3-2, obtaining the correlation set in step B, firstly inquiring whether there is a corresponding correlation set through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transport, most copies of a Data block are stored on the same chassis, but following the principle of cluster availability, two of the Data block copies must be stored on different chassis.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
A. Judging the cold and hot degree of the data block; B. analyzing the relevance of the data blocks; C. classifying the data nodes; D. a hot based data block placement policy. One specific implementation is as follows:
specifically, the present invention comprises the steps of:
A. and judging the cold and hot degree of the data block. The realization method comprises the following steps:
A1. calculating the access frequency of each data block;
a1-1, obtaining the number of read operations of each data block in HDFS in a specified period T by a Flume log collecting tool and recording as M, wherein the access frequency of each period may have greater contrast, thereby setting a balance factor
And the access frequency of the previous period is denoted as B _ f (pre), and the access frequency of the current period B _ f is calculated by the following formula:
Figure BDA0002414041590000081
a1-2, the access frequency B _ f (i) of the data block in the ith period can be derived according to the formula (1) in the step A1-1, wherein B _ f (0) represents the access frequency when the data block is created, since the data block is just created and has no historical access condition in the previous period, the value is 0, and the calculation formula B _ f (i) is as follows:
Figure BDA0002414041590000082
A2. calculating an average access frequency B _ F (avg) according to the access frequency of each period of the data block obtained in the step;
A3. and B _ F (avg) in the step A2 is used for measuring the heat degree of the data blocks, and the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks according to the sequence of the heat degree from high to low.
B. The data block correlation analysis is realized by the following method:
B1. a data block having a dependency;
b1-1, the correlation here refers to a certain degree of association between data blocks of a cluster, such as data block B1 and data block B2, the first case is that a user accesses data block B1 and also accesses data block B2, and the second case is that data block B2 also has linear change in the same direction when the access frequency of data block B1 increases or decreases with time period, based on which we call data block B1 and data block B2 have correlation;
B2. a method of detecting correlation;
b2-1. according to the access frequency of each data block obtained in the step A in different periods, the correlation analysis is carried out by using covariance, for example, the correlation of the data blocks B1 and B2 is detected, and the following formula can be used for calculation:
Figure BDA0002414041590000091
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively,
Figure BDA0002414041590000092
and
Figure BDA0002414041590000093
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
b2-2, if the calculated covariance cov is a positive number, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, if the value of cov is 0, it indicates that the two data blocks are independent from each other, and if the value is a negative value, it indicates a negative correlation, which is not a research focus of the present invention, and the present invention mainly focuses on positive correlation, that is, the value of cov is a positive number;
b2-3. this kind of detection method is used to detect whether there is a correlation between two data blocks, but there may be more than two data blocks with correlation, so if data block B1 is correlated with data block B2 and data block B2 (or B1) is correlated with data block B3 during the detection process, then data blocks B1, B2, B3 are all correlated;
B3. creating a data block set of the correlation and marking;
b3-1, classifying the data blocks according to the heat degree in the step A (hot spot, medium heat, normal and cold data blocks), detecting the category of the data blocks with the correlation, traversing the categories to detect the correlation of the data blocks, if two or more data blocks have the correlation, establishing a correlation set C which is { B1, B2, …, Bn }, wherein n represents the number of the data blocks in the set, and each set takes the BID of the first data block as a mark;
C. the data node classification is realized by the following steps:
C1. the difference of hardware is mainly reflected on a CPU, a disk I/O, a network and a memory (because the memory resource is mainly reflected on the size of the memory, the difference between performances is small, and network transmission is not the key point of the research of the invention, the two items are not considered), but the classification standard of the invention mainly focuses on the CPU and the disk I/O;
C2. our classification is probably of several kinds: 1) machines with strong CPU and IO performance are called MAX types, 2) machines with strong CPU performance and general IO performance are called CPU types, 3) machines with strong IO performance and general IO performance are called IO types, 4) machines with strong IO performance and general IO performance are called CIM types, and 5) machines with weak IO performance are called CIB types;
D. the data block placement strategy based on the heat degree is realized by the following steps:
D1. in the step A, the data blocks are divided into hot spot data blocks, medium heat data blocks, normal data blocks and cold data blocks, and four queues are generated according to the four classifications;
d1-1, hot spot data block queue B (h) { B1, B2, …, Bm }, where m is the number of data blocks; medium thermal data block queue B (m) { B1, B2, … Bj }, j being the number of data blocks; normal data block queue B (n) { B1, B2, …, Bk }, where k is the number of data blocks; b (c) { B1, B2, …, Bn }, n being the number of data blocks;
D2. obtaining several groups of data node queues according to the classification in the step C2;
d2-1.1) MAX class data node queue D (MAX) { D1, D2, …, Dm }, m being the number of data nodes; 2) the CPU class data node queue D (CPU) { D1, D2, …, Dn }, where n is the number of data nodes; 3) the IO class data node queue D (IO) { D1, D2, …, Dj }, j being the number of data nodes; 4) the CIM-class data node queue D (CIM) { D1, D2, …, Dk }, where k is the number of data nodes; 5) a CIB-type data node queue D (CIB) { D1, D2, …, Dl }, where l is the number of data nodes;
d2-2, for the data nodes in the queue D (max), only storing the data block copies in the queue B (h), the data blocks in the queues B (m) and B (n) can be stored in the data nodes in the queues D (CPU), D (IO) and D (CIM), wherein the medium-heat data block is preferentially stored in the IO class data node (IO is the performance support mainly required by the high-heat data), then the data nodes in the CPU class are considered, and finally the CIM class data node is considered. The normal data blocks are preferentially stored on CIM data nodes, and CPU nodes and IO nodes are not considered unless the data nodes are stored in saturation; for the cold data block B (c), the queue can only be stored in the data node of the queue D (cib);
D3. the placement strategy takes into account data dependencies;
d3-1, before copying, considering whether there is related data block, avoiding storing two or more related data blocks on a node, so that when a user accesses one of the data blocks, the node may undertake the access of multiple data blocks, and should copy them dispersedly;
d3-2, obtaining the correlation set in step B, firstly inquiring whether there is a corresponding correlation set through the BID of the data block, if not, skipping the step, if so, recording the data block related to the data block, marking the current node, and not considering the current node when executing the distribution operation of the data blocks;
D4. considering the problems of Data localization (Data Locality) and network transport, most copies of a Data block are stored on the same chassis, but following the principle of cluster availability, two of the Data block copies must be stored on different chassis.
The invention has the advantages that the data block copy strategy based on the heterogeneous Hadoop cluster environment is provided, the cold and hot degree of the data block is measured by calculating the access frequency of each period of the data block, then the data block is placed on different data nodes according to the difference of the heat degree of the data block, the problem of the relevance of the data block is considered in the placement process, the data blocks with the relevance are placed in a scattered mode and are not stored on the same data node at the same time, the situation that a plurality of data blocks are accessed at the same time on one data node is avoided, and the load of the data node is reduced. The placing strategy provided by the invention improves the execution performance of the cluster and the utilization rate of resources. Fig. 1 details the overall flow of the present invention, and fig. 2 details the flow of the data block placement strategy.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a data block placement system based on the heterogeneous Hadoop cluster environment, which comprises the following steps:
the module 1 divides the data blocks stored in the heterogeneous cluster environment into hot data blocks, medium-hot data blocks, normal data blocks and cold data blocks according to the access frequency of the data blocks, and classifies the data nodes in the heterogeneous cluster environment according to different performances according to the performances of the data nodes in the heterogeneous cluster environment and a preset performance standard;
the module 2 is used for analyzing the relevance of the data blocks and marking the data blocks with relevance in each classification of the data blocks;
the module 3 executes a data block placement strategy, and places each data block on data nodes of different classifications according to different performance requirements according to the classifications of the data block and the data node;
the module 4 judges whether other data blocks which are relevant to the data blocks exist in the data nodes selected and placed by the current data blocks when executing the data block placing strategy, if so, the module 3 is re-executed in the classification of the data nodes, and other data nodes are selected and placed;
and a module 5, completing the placement of the current data block, and executing the module 3 again until all the data nodes are placed.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises a module 1 and a module management module, wherein the module management module comprises:
the module 11 obtains, by using a log collection tool, the number M of read operations of each data block in the heterogeneous cluster environment within a specified period T, and obtains, according to the balance factor τ and the access frequency B _ f (pre) of the previous period, the access frequency B _ f of the current period:
Figure BDA0002414041590000121
the module 12 calculates an average access frequency B _ f (avg) according to the access frequency of each period of the data block to measure the heat of the data block, and sequentially divides the data block into a hot data block and a cold data block according to the heat from high to low.
The data block placement system based on the heterogeneous Hadoop cluster environment comprises the following modules 2:
the module 21 analyzes the correlation by using the covariance cov between the data blocks according to the access frequency of each data block in different periods obtained in the module a:
Figure BDA0002414041590000122
where n is the number of cycles, i is the current cycle, X and Y represent the access frequency of data blocks B1 and B2, respectively, in the current cycle,
Figure BDA0002414041590000123
and
Figure BDA0002414041590000124
represents the average access frequency of the data blocks B1 and B2 in n periods respectively;
the module 22 determines whether the covariance cov is positive, if so, it indicates that the variation trends of the access frequencies of the two data blocks are consistent, and the data blocks B1 and B2 have access correlation, otherwise, it indicates that the data blocks B1 and B2 do not have access correlation.

Claims (6)

Translated fromChinese
1.一种基于异构Hadoop集群环境的数据块放置方法,其特征在于,包括:1. a data block placement method based on heterogeneous Hadoop cluster environment, is characterized in that, comprises:步骤1、根据数据块被访问的频率,将存储于异构集群环境中的数据块分为热点数据块、中热数据块、正常数据块和冷门数据块,根据异构集群环境中各数据节点的性能和预设的性能标准,将异构集群环境中数据节点按性能的不同进行分类;Step 1. According to the frequency of data blocks being accessed, the data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-hot data blocks, normal data blocks and unpopular data blocks. According to the performance and preset performance standards, the data nodes in the heterogeneous cluster environment are classified according to their performance;步骤2、进行数据块相关性分析,将数据块各分类中具有相关性的数据块进行标记;Step 2, carry out data block correlation analysis, and mark the data blocks with correlation in each classification of the data block;步骤3、执行数据块放置策略,根据数据块和数据节点的分类,将各数据块按照性能需求的不同放置到不同分类的数据节点上;Step 3. Execute the data block placement strategy, and place each data block on data nodes of different classifications according to different performance requirements according to the classification of data blocks and data nodes;步骤4、执行数据块放置策略时判断当前数据块选择放置的数据节点中是否有与它具有相关性的其他数据块,若有则在此数据节点的分类中重新执行步骤3,选择其他数据节点进行放置;Step 4. When executing the data block placement strategy, determine whether there are other data blocks related to it in the data node that the current data block is selected to place, and if so, re-execute step 3 in the classification of this data node, and select other data nodes to place;步骤5、完成当前数据块的放置,再次执行步骤3,直到所有的数据节点完成放置。Step 5. Complete the placement of the current data block, and perform step 3 again until all data nodes are placed.2.如权利要求1所述的基于异构Hadoop集群环境的数据块放置方法,其特征在于,该步骤1包括:2. The method for placing data blocks based on a heterogeneous Hadoop cluster environment as claimed in claim 1, wherein the step 1 comprises:步骤11、通过日志收集工具获取规定周期T内异构集群环境中各数据块的读操作次数M,根据平衡因子τ,上一周期的访问频率B_f(pre),得到当前周期的访问频率B_f:Step 11. Obtain the number of read operations M of each data block in the heterogeneous cluster environment within the specified period T through the log collection tool, and obtain the access frequency B_f of the current period according to the balance factor τ, the access frequency B_f(pre) of the previous period:
Figure FDA0002414041580000011
Figure FDA0002414041580000011
步骤12、根据数据块各周期的访问频率计算出平均访问频率B_F(avg),以衡量数据块的热度,按照热度由高到低依次将数据块分为热点数据块和冷门数据块。Step 12: Calculate the average access frequency B_F(avg) according to the access frequency of each cycle of the data block to measure the heatness of the data block, and divide the data blocks into hot data blocks and unpopular data blocks in descending order of heatness.3.如权利要求2所述的基于异构Hadoop集群环境的数据块放置方法,其特征在于,该步骤2包括:3. The data block placement method based on heterogeneous Hadoop cluster environment as claimed in claim 2, is characterized in that, this step 2 comprises:步骤21、根据步骤A中获得的各数据块在不同周期的访问频率,利用数据块间的协方差cov进行相关性的分析:Step 21. According to the access frequency of each data block obtained in step A in different periods, use the covariance cov between the data blocks to analyze the correlation:
Figure FDA0002414041580000021
Figure FDA0002414041580000021
其中n为周期数,i为当前周期,X和Y分别代表数据块B1和B2的在当前周期的访问频率,
Figure FDA0002414041580000024
Figure FDA0002414041580000022
分别代表n个周期内数据块B1和B2的平均访问频率;
where n is the number of cycles, i is the current cycle, X and Y respectively represent the access frequency of the data blocks B1 and B2 in the current cycle,
Figure FDA0002414041580000024
and
Figure FDA0002414041580000022
respectively represent the average access frequency of data blocks B1 and B2 in n cycles;
步骤22、判断该协方差cov是否为为正数,若是则表明两个数据块访问频率的变化趋势一致,数据块B1和B2具有访问相关性,否则表明这数据块B1和B2不具有访问相关性。Step 22, determine whether the covariance cov is a positive number, if so, it indicates that the change trend of the access frequency of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise it indicates that the data blocks B1 and B2 do not have access correlation. sex.
4.一种基于异构Hadoop集群环境的数据块放置系统,其特征在于,包括:4. a data block placement system based on heterogeneous Hadoop cluster environment, is characterized in that, comprises:模块1、根据数据块被访问的频率,将存储于异构集群环境中的数据块分为热点数据块、中热数据块、正常数据块和冷门数据块,根据异构集群环境中各数据节点的性能和预设的性能标准,将异构集群环境中数据节点按性能的不同进行分类;Module 1. According to the frequency of data blocks being accessed, the data blocks stored in the heterogeneous cluster environment are divided into hot data blocks, medium-hot data blocks, normal data blocks and unpopular data blocks. According to the data nodes in the heterogeneous cluster environment According to the performance and preset performance standards, the data nodes in the heterogeneous cluster environment are classified according to their different performances;模块2、进行数据块相关性分析,将数据块各分类中具有相关性的数据块进行标记;Module 2, perform data block correlation analysis, and mark data blocks with correlation in each classification of data blocks;模块3、执行数据块放置策略,根据数据块和数据节点的分类,将各数据块按照性能需求的不同放置到不同分类的数据节点上;Module 3. Execute the data block placement strategy, according to the classification of data blocks and data nodes, place each data block on data nodes of different classifications according to different performance requirements;模块4、执行数据块放置策略时判断当前数据块选择放置的数据节点中是否有与它具有相关性的其他数据块,若有则在此数据节点的分类中重新执行模块3,选择其他数据节点进行放置;Module 4. When executing the data block placement strategy, determine whether there are other data blocks related to it in the data node selected and placed by the current data block. If there is, re-execute module 3 in the classification of this data node, and select other data nodes. to place;模块5、完成当前数据块的放置,再次执行模块3,直到所有的数据节点完成放置。Module 5. Complete the placement of the current data block, and execute module 3 again until all data nodes are placed.5.如权利要求1所述的基于异构Hadoop集群环境的数据块放置系统,其特征在于,该模块1包括:5. The system for placing data blocks based on a heterogeneous Hadoop cluster environment as claimed in claim 1, wherein the module 1 comprises:模块11、通过日志收集工具获取规定周期T内异构集群环境中各数据块的读操作次数M,根据平衡因子τ,上一周期的访问频率B_f(pre),得到当前周期的访问频率B_f:Module 11: Obtain the number of read operations M of each data block in the heterogeneous cluster environment within the specified period T through the log collection tool, and obtain the access frequency B_f of the current period according to the balance factor τ, the access frequency B_f(pre) of the previous period:
Figure FDA0002414041580000023
Figure FDA0002414041580000023
模块12、根据数据块各周期的访问频率计算出平均访问频率B_F(avg),以衡量数据块的热度,按照热度由高到低依次将数据块分为热点数据块和冷门数据块。Module 12: Calculate the average access frequency B_F(avg) according to the access frequency of each cycle of the data block to measure the heat of the data block, and divide the data blocks into hot data blocks and unpopular data blocks according to the heat from high to low.
6.如权利要求2所述的基于异构Hadoop集群环境的数据块放置系统,其特征在于,该模块2包括:6. The data block placement system based on heterogeneous Hadoop cluster environment as claimed in claim 2, is characterized in that, this module 2 comprises:模块21、根据模块A中获得的各数据块在不同周期的访问频率,利用数据块间的协方差cov进行相关性的分析:Module 21. According to the access frequency of each data block obtained in module A in different periods, use the covariance cov between the data blocks to analyze the correlation:
Figure FDA0002414041580000031
Figure FDA0002414041580000031
其中n为周期数,i为当前周期,X和Y分别代表数据块B1和B2的在当前周期的访问频率,
Figure FDA0002414041580000032
Figure FDA0002414041580000033
分别代表n个周期内数据块B1和B2的平均访问频率;
where n is the number of cycles, i is the current cycle, X and Y represent the access frequencies of data blocks B1 and B2 in the current cycle, respectively,
Figure FDA0002414041580000032
and
Figure FDA0002414041580000033
respectively represent the average access frequency of data blocks B1 and B2 in n cycles;
模块22、判断该协方差cov是否为为正数,若是则表明两个数据块访问频率的变化趋势一致,数据块B1和B2具有访问相关性,否则表明这数据块B1和B2不具有访问相关性。Module 22. Determine whether the covariance cov is a positive number. If so, it indicates that the change trend of the access frequency of the two data blocks is consistent, and the data blocks B1 and B2 have access correlation, otherwise it indicates that the data blocks B1 and B2 do not have access correlation. sex.
CN202010185518.2A2020-03-172020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environmentActiveCN113407620B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010185518.2ACN113407620B (en)2020-03-172020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010185518.2ACN113407620B (en)2020-03-172020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Publications (2)

Publication NumberPublication Date
CN113407620Atrue CN113407620A (en)2021-09-17
CN113407620B CN113407620B (en)2023-04-21

Family

ID=77677033

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010185518.2AActiveCN113407620B (en)2020-03-172020-03-17 Data block placement method and system based on heterogeneous Hadoop cluster environment

Country Status (1)

CountryLink
CN (1)CN113407620B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118535652A (en)*2024-07-252024-08-23卓世智星(青田)元宇宙科技有限公司Big data storage method and system
KR102744763B1 (en)*2024-06-192024-12-20한화시스템 주식회사System and method for managing distributed file based on hadoop

Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103023995A (en)*2012-11-292013-04-03中国电力科学研究院Hadoop-based distributive type cloud storage type automatic grading data managing system
CN103593452A (en)*2013-11-212014-02-19北京科技大学Data intensive computing cost optimization method based on MapReduce mechanism
CN103631894A (en)*2013-11-192014-03-12浪潮电子信息产业股份有限公司Dynamic copy management method based on HDFS
CN103856567A (en)*2014-03-262014-06-11西安电子科技大学Small file storage method based on Hadoop distributed file system
CN103942289A (en)*2014-04-122014-07-23广西师范大学Memory caching method oriented to range querying on Hadoop
CN104133882A (en)*2014-07-282014-11-05四川大学HDFS (Hadoop Distributed File System)-based old file processing method
CN105183839A (en)*2015-09-022015-12-23华中科技大学Hadoop-based storage optimizing method for small file hierachical indexing
CN106156283A (en)*2016-06-272016-11-23江苏迪纳数字科技股份有限公司Isomery Hadoop based on data temperature and joint behavior stores method
US20170344546A1 (en)*2015-06-102017-11-30Unist (Ulsan National Institute Of Science And Technology)Code dispersion hash table-based map-reduce system and method
CN108519856A (en)*2018-03-022018-09-11西北大学 Data block copy placement method based on heterogeneous Hadoop cluster environment
US20190034447A1 (en)*2015-12-082019-01-31EMC IP Holding Company LLCMethods and Apparatus for Filtering Dynamically Loadable Namespaces (DLNs)
CN109446114A (en)*2018-10-122019-03-08咪咕文化科技有限公司Spatial data caching method and device and storage medium
US20190129640A1 (en)*2017-10-302019-05-02drivewarp, LLCSystem and method for data storage, transfer, synchronization, and security
US20190188025A1 (en)*2019-02-082019-06-20Intel CorporationProvision of input/output classification in a storage system
CN110096350A (en)*2019-04-102019-08-06山东科技大学Cold and hot region division energy saving store method based on the prediction of clustered node load condition
CN110515920A (en)*2019-08-302019-11-29北京浪潮数据技术有限公司A kind of mass small documents access method and system based on Hadoop
CN110647497A (en)*2019-07-192020-01-03广东工业大学 A high-performance file storage and management system based on HDFS

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103023995A (en)*2012-11-292013-04-03中国电力科学研究院Hadoop-based distributive type cloud storage type automatic grading data managing system
CN103631894A (en)*2013-11-192014-03-12浪潮电子信息产业股份有限公司Dynamic copy management method based on HDFS
CN103593452A (en)*2013-11-212014-02-19北京科技大学Data intensive computing cost optimization method based on MapReduce mechanism
CN103856567A (en)*2014-03-262014-06-11西安电子科技大学Small file storage method based on Hadoop distributed file system
CN103942289A (en)*2014-04-122014-07-23广西师范大学Memory caching method oriented to range querying on Hadoop
CN104133882A (en)*2014-07-282014-11-05四川大学HDFS (Hadoop Distributed File System)-based old file processing method
US20170344546A1 (en)*2015-06-102017-11-30Unist (Ulsan National Institute Of Science And Technology)Code dispersion hash table-based map-reduce system and method
CN105183839A (en)*2015-09-022015-12-23华中科技大学Hadoop-based storage optimizing method for small file hierachical indexing
US20190034447A1 (en)*2015-12-082019-01-31EMC IP Holding Company LLCMethods and Apparatus for Filtering Dynamically Loadable Namespaces (DLNs)
CN106156283A (en)*2016-06-272016-11-23江苏迪纳数字科技股份有限公司Isomery Hadoop based on data temperature and joint behavior stores method
US20190129640A1 (en)*2017-10-302019-05-02drivewarp, LLCSystem and method for data storage, transfer, synchronization, and security
CN108519856A (en)*2018-03-022018-09-11西北大学 Data block copy placement method based on heterogeneous Hadoop cluster environment
CN109446114A (en)*2018-10-122019-03-08咪咕文化科技有限公司Spatial data caching method and device and storage medium
US20190188025A1 (en)*2019-02-082019-06-20Intel CorporationProvision of input/output classification in a storage system
CN110096350A (en)*2019-04-102019-08-06山东科技大学Cold and hot region division energy saving store method based on the prediction of clustered node load condition
CN110647497A (en)*2019-07-192020-01-03广东工业大学 A high-performance file storage and management system based on HDFS
CN110515920A (en)*2019-08-302019-11-29北京浪潮数据技术有限公司A kind of mass small documents access method and system based on Hadoop

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘艳等: "异构Hadoop集群中数据副本放置策略优化", 华中科技大学学报(自然科学版)*
陈麒瑞等: "基于人工神经网络的机器人路径规划研究", 电脑知识与技术*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR102744763B1 (en)*2024-06-192024-12-20한화시스템 주식회사System and method for managing distributed file based on hadoop
CN118535652A (en)*2024-07-252024-08-23卓世智星(青田)元宇宙科技有限公司Big data storage method and system

Also Published As

Publication numberPublication date
CN113407620B (en)2023-04-21

Similar Documents

PublicationPublication DateTitle
CN113391913B (en) A distributed scheduling method and device based on prediction
CN107832153B (en) A Hadoop cluster resource adaptive allocation method
Zhang et al.Virtual machine placement strategy using cluster-based genetic algorithm
CN113467944B (en) Resource deployment device and method for complex software system
Ubarhande et al.Novel data-distribution technique for Hadoop in heterogeneous cloud environments
Dai et al.Provenance-based object storage prediction scheme for scientific big data applications
Wang et al.Lunule: an agile and judicious metadata load balancer for cephfs
Bawankule et al.Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster
CN114077492A (en)Prediction model training and prediction method and system for cloud computing infrastructure resources
CN108519856A (en) Data block copy placement method based on heterogeneous Hadoop cluster environment
CN113407620A (en)Data block placement method and system based on heterogeneous Hadoop cluster environment
Liu et al.On a dynamic data placement strategy for heterogeneous hadoop clusters
CN104102557B (en)A kind of cloud computing platform data back up method based on cluster
CN114020218B (en)Hybrid de-duplication scheduling method and system
CN110048886A (en)A kind of efficient cloud configuration selection algorithm of big data analysis task
CN117827467B (en)Dynamic portrait-based virtual machine resource allocation method
US10594620B1 (en)Bit vector analysis for resource placement in a distributed system
US20240103934A1 (en)Allocation control apparatus, computer system, and allocation control method
Myint et al.Comparative analysis of adaptive file replication algorithms for cloud data storage
CN115509758A (en)Interference quantification method and system for mixed part load
Hu et al.Reloca: Optimize resource allocation for data-parallel jobs using deep learning
Mao et al.FiGMR: A fine-grained mapreduce scheduler in the heterogeneous cloud
CN111949281A (en) A database installation method based on AI configuration, user equipment, and storage medium
CN118363764B (en) Application performance optimization method, device, electronic device and storage medium
CN115016724B (en) Data processing method, device, data processing equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
TR01Transfer of patent right

Effective date of registration:20240329

Address after:100095, 2nd Floor, Building 1, Baijiatong Shangpingyuan, Haidian District, Beijing, 20218

Patentee after:Beijing United Power Cultural Media Co.,Ltd.

Country or region after:China

Address before:100101 12 Xiaoying East Road, Qinghe, Haidian District, Beijing

Patentee before:BEIJING INFORMATION SCIENCE AND TECHNOLOGY University

Country or region before:China

TR01Transfer of patent right
TR01Transfer of patent right

Effective date of registration:20240708

Address after:Room 309, 3rd Floor, Building D, No. 2-2, Beijing Shichuang High tech Development Corporation, 2 Shangdi Information Road, Haidian District, Beijing 100088

Patentee after:Best Innovation (Beijing) Technology Co.,Ltd.

Country or region after:China

Address before:100095, 2nd Floor, Building 1, Baijiatong Shangpingyuan, Haidian District, Beijing, 20218

Patentee before:Beijing United Power Cultural Media Co.,Ltd.

Country or region before:China

TR01Transfer of patent right

[8]ページ先頭

©2009-2025 Movatter.jp