技术领域technical field
本发明涉及网络通信技术领域,尤其涉及一种数据处理方法及装置。The present invention relates to the technical field of network communication, in particular to a data processing method and device.
背景技术Background technique
Ceph(分布式存储系统)是一种性能优异、高可靠性和高扩展性分布式存储系统,广泛应用于各类大中小型存储环境。通过File(文件)与object(对象)、object与PG(Placement Group,归置组)、PG与OSD(Object Storage Device,对象存储设备)三次映射完成ceph系统的寻址流程。object与PG之间通过哈希算法实现,而PG与OSD之间通过CRUSH(Controlled Replication Under Scalable Hashing,可扩展哈希下的受控复制)算法完成映射。Ceph (distributed storage system) is a distributed storage system with excellent performance, high reliability and high scalability, which is widely used in various large, medium and small storage environments. The addressing process of the ceph system is completed through three mappings between File (file) and object (object), object and PG (Placement Group, placement group), and PG and OSD (Object Storage Device, object storage device). The object and PG are implemented through the hash algorithm, and the mapping between the PG and OSD is completed through the CRUSH (Controlled Replication Under Scalable Hashing) algorithm.
故障域是Ceph集群中另外一个重要概念,通过故障域的引入,与冗余策略相结合,集群可将一份数据的副本或分片优先存放在不同的故障域。在单个故障域出现异常时,仍然可以正常对外提供存储服务。The fault domain is another important concept in the Ceph cluster. Through the introduction of the fault domain, combined with the redundancy strategy, the cluster can preferentially store a copy or fragment of data in different fault domains. When an exception occurs in a single fault domain, storage services can still be provided normally.
然而实践发现,在数据冗余策略指定的副本数与被指定为故障域的bucket(桶)节点的数量相等的情况下,PG均匀的映射到每一个被指定为故障域的bucket节点上,但是当不同bucket节点之间存储容量相差较大的情况下,存储容量较小的bucket节点很快会成为存储容量瓶颈,导致Ceph集群各bucket节点之间会出现巨大的存储使用率差异。当存在OSD的存储使用率达到上限(例如95%)时,此时Ceph集群不能对外提供服务,而其他大容量的节点仍然有较大的存储容量富余,导致整个Ceph集群的存储利用率较低。However, practice has found that when the number of copies specified by the data redundancy policy is equal to the number of bucket (bucket) nodes designated as the fault domain, PG is evenly mapped to each bucket node designated as the fault domain, but When the storage capacity of different bucket nodes differs greatly, the bucket node with smaller storage capacity will soon become a storage capacity bottleneck, resulting in a huge difference in storage usage among the bucket nodes of the Ceph cluster. When the storage usage of OSD reaches the upper limit (for example, 95%), the Ceph cluster cannot provide external services at this time, and other large-capacity nodes still have a large storage capacity surplus, resulting in low storage utilization of the entire Ceph cluster .
发明内容Contents of the invention
本发明提供一种数据处理方法及装置,以解决现有Ceph集群中在数据冗余策略指定的副本数与被指定为故障域的bucket节点的数量相等,且不同bucket节点之间存储容量相差较大的情况下,存储容量较小的bucket节点很快会成为存储容量瓶颈的问题。The present invention provides a data processing method and device to solve the problem that the number of copies specified in the data redundancy policy in the existing Ceph cluster is equal to the number of bucket nodes specified as fault domains, and the storage capacity of different bucket nodes is relatively different. In large cases, bucket nodes with small storage capacity will soon become a storage capacity bottleneck.
根据本发明实施例的第一方面,提供一种数据处理方法,应用于分布式存储系统Ceph集群的监视器,该方法包括:According to the first aspect of the embodiments of the present invention, a kind of data processing method is provided, is applied to the monitor of distributed storage system Ceph cluster, and this method comprises:
对于所述Ceph集群的任一存储池pool,当该pool的副本数与被指定为故障域的桶bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;For any storage pool pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket bucket nodes designated as failure domains, the pool is added to the first type of pool group corresponding to the bucket node;
对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于所述第一理论存储使用率,并在迁移完成后更新该第一类型pool组对应的可扩展哈希下的受控复制CRUSH信息图。For any first-type pool group, when the first theoretical storage usage rate of the first-type pool group is less than a preset usage threshold, trigger storage unit migration between bucket nodes corresponding to the first-type pool group, to Make the second theoretical storage usage rate of the first type pool group after migration greater than the first theoretical storage usage rate, and update the controlled replication CRUSH under the scalable hash corresponding to the first type pool group after the migration is completed Infographic.
根据本发明实施例的第二方面,提供一种数据处理装置,应用于分布式存储系统Ceph集群的监视器,该装置包括:According to a second aspect of the embodiments of the present invention, a data processing device is provided, which is applied to a monitor of a distributed storage system Ceph cluster, and the device includes:
pool组管理单元,用于对于所述Ceph集群的任一存储池pool,当该pool的副本数与被指定为故障域的桶bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;The pool group management unit is used for, for any storage pool pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket bucket nodes designated as the fault domain, the pool is added to the bucket node corresponding to the first A type of pool group;
迁移单元,用于对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于所述第一理论存储使用率;The migration unit is configured to, for any first type of pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger the transfer between the bucket nodes corresponding to the first type pool group The storage unit is migrated, so that the second theoretical storage usage rate of the first type pool group after migration is greater than the first theoretical storage usage rate;
维护单元,用于在迁移完成后更新该第一类型pool组对应的可扩展哈希下的受控复制CRUSH信息图map。The maintenance unit is configured to update the controlled replication CRUSH information map under the scalable hash corresponding to the first type of pool group after the migration is completed.
应用本发明实施例,对于Ceph集群的任一pool,当该pool的副本数与被指定为故障域的bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于第一理论存储使用率,并在迁移完成后更新该第一类型pool组对应的CRUSH信息图,提高了第一类型pool组的理论存储使用率,进而可以提高Ceph集群的存储使用率。Apply the embodiment of the present invention, for any pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket nodes designated as fault domains, the pool is added to the first type of pool group corresponding to the bucket node; for For any first type pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger storage unit migration between bucket nodes corresponding to the first type pool group, so that After the migration, the second theoretical storage usage rate of the first type pool group is greater than the first theoretical storage usage rate, and the CRUSH information map corresponding to the first type pool group is updated after the migration is completed, which improves the theoretical capacity of the first type pool group Storage usage, which in turn can improve the storage usage of the Ceph cluster.
附图说明Description of drawings
图1是本发明实施例提供的一种数据处理方法的流程示意图;Fig. 1 is a schematic flow chart of a data processing method provided by an embodiment of the present invention;
图2A~2B是本发明实施例提供的具体应用场景下的bucket节点的权重示意图;2A-2B are schematic diagrams of the weights of bucket nodes in specific application scenarios provided by the embodiments of the present invention;
图3是本发明实施例提供的一种数据处理装置的结构示意图;FIG. 3 is a schematic structural diagram of a data processing device provided by an embodiment of the present invention;
图4是本发明实施例提供的一种数据处理装置的结构示意图;Fig. 4 is a schematic structural diagram of a data processing device provided by an embodiment of the present invention;
图5是本发明实施例提供的一种数据处理装置的硬件结构示意图。Fig. 5 is a schematic diagram of a hardware structure of a data processing device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明实施例中的技术方案,下面先对本发明涉及的部分名词概念进行简单说明。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, some nouns and concepts involved in the present invention will be briefly described below.
1、pool(存储池):pool为PG的集合,pool建立时需要配置数据冗余策略以及指定作为故障域的bucket节点;其中,数据冗余策略指定了pool的副本数;pool建立之后,若pool的数据冗余策略指定的副本数与被指定为该pool的故障域的bucket节点的数量相等,则加入该pool的PG会均匀的映射到每一个被指定为故障域的bucket节点;1. Pool (storage pool): The pool is a collection of PGs. When the pool is established, it is necessary to configure the data redundancy policy and specify the bucket node as the fault domain; among them, the data redundancy policy specifies the number of copies of the pool; after the pool is established, if The number of copies specified by the data redundancy policy of the pool is equal to the number of bucket nodes designated as the fault domain of the pool, and the PGs added to the pool will be evenly mapped to each bucket node designated as the fault domain;
需要说明的是,pool的数据冗余策略指定的副本数与被指定为故障域的bucket节点的数量也可以不相等,在该情况下,根据CRUSH算法,PG也会以相对均匀的方式被映射到被指定为故障域的bucket节点。It should be noted that the number of copies specified by the data redundancy policy of the pool may not be equal to the number of bucket nodes specified as fault domains. In this case, according to the CRUSH algorithm, PGs will also be mapped in a relatively uniform manner. to the bucket node designated as the fault domain.
2、第一类型pool组:本文中也可以称为等额bucket组,数据冗余策略指定的副本数与被指定为故障域的bucket节点的数量相等的pool可以加入到该第一类型pool组;其中,第一类型pool组是基于bucket节点创建的,相同类型的bucket节点对应同一个第一类型pool组,不同类型的bucket节点对应不同的第一类型pool组;当多个pool的故障域为同一类型的bucket节点,且该多个pool的数据冗余策略指定的副本数与该类型的bucket节点的数量相等时,该多个pool需加入同一第一类型pool组;2. The first type of pool group: It can also be called equal bucket group in this article. The pool whose number of copies specified by the data redundancy policy is equal to the number of bucket nodes designated as the fault domain can be added to the first type of pool group; Among them, the first type pool group is created based on bucket nodes, the same type of bucket nodes correspond to the same first type pool group, and different types of bucket nodes correspond to different first type pool groups; when the fault domains of multiple pools are Bucket nodes of the same type, and the number of copies specified by the data redundancy policy of the multiple pools is equal to the number of bucket nodes of this type, the multiple pools need to join the same first type pool group;
其中,bucket节点的类型可以包括但不限于服务器、机架、机房或数据中心等。Wherein, the type of the bucket node may include but not limited to a server, a rack, a computer room, or a data center.
其中,对于第一类型pool组对应的多个bucket节点,逻辑上允许各bucket节点内的存储单元(如OSD)在bucket节点之间进行迁移,而并不改变实际的物理部署;Among them, for multiple bucket nodes corresponding to the first type of pool group, logically allow storage units (such as OSD) in each bucket node to migrate between bucket nodes without changing the actual physical deployment;
例如,假设将bucket A上的OSD1迁移至bucket B,则该OSD1逻辑上归属于bucketB,而实际上仍归属于(即物理归属于)bucket A;For example, assuming that OSD1 on bucket A is migrated to bucket B, the OSD1 logically belongs to bucket B, but actually still belongs to (that is, physically belongs to) bucket A;
3、第二类型pool组:本文中也可以称为非等额组,非等额组整个Ceph集群会且仅会维护一个,Ceph集群中各pool初始时默认归属于第二类型pool组;3. The second type of pool group: In this article, it can also be called a non-equivalent group. The entire Ceph cluster of the non-equal group will only maintain one. Each pool in the Ceph cluster initially belongs to the second type of pool group by default;
4、权重:bucket节点的存储容量又称为bucket节点的权重,例如,可以定义存储容量为1T的bucket节点的权重为1,则存储容量为100T的bucket节点的权重为100,存储容量为500G的bucket节点的权重为0.5;4. Weight: The storage capacity of a bucket node is also called the weight of the bucket node. For example, the weight of a bucket node with a storage capacity of 1T can be defined as 1, and the weight of a bucket node with a storage capacity of 100T is 100, and the storage capacity is 500G The weight of the bucket node is 0.5;
5、第一类型pool组的理论存储使用率:第一类型pool组对应的多个bucket节点中,权重最小的bucket节点的权重与该多个bucket节点的权重的平均值的比值;5. The theoretical storage usage rate of the first type of pool group: among the multiple bucket nodes corresponding to the first type of pool group, the ratio of the weight of the bucket node with the smallest weight to the average weight of the multiple bucket nodes;
例如,假设第一类型pool组对应的bucket节点包括bucket A、bucket B和bucketC,且各bucket节点的权重分别为30、30和18,则该第一类型pool组的理论存储使用率为70%(18/[(30+30+18)/3]=70%);For example, assuming that the bucket nodes corresponding to the first type pool group include bucket A, bucket B, and bucket C, and the weights of each bucket node are 30, 30, and 18 respectively, then the theoretical storage usage rate of the first type pool group is 70%. (18/[(30+30+18)/3]=70%);
6、实OSD:逻辑归属和物理归属于同一bucket节点的OSD;6. Real OSD: OSD that logically and physically belongs to the same bucket node;
7、虚OSD:逻辑归属和物理归属于不同bucket节点的OSD;7. Virtual OSD: OSD that logically and physically belongs to different bucket nodes;
例如,假设bucket A中的OSD1被迁移至bucket B,则对于OSD1而言,其逻辑上归属于bucket B,而物理归属于bucket A,即OSD1属于虚OSD;对于bucket A上未进行迁移的OSD2,其逻辑上归属于bucket A,物理归属也为bucket A,即OSD2属于实OSD。For example, assuming that OSD1 in bucket A is migrated to bucket B, OSD1 logically belongs to bucket B and physically belongs to bucket A, that is, OSD1 belongs to virtual OSD; OSD2 that has not been migrated on bucket A , which logically belongs to bucket A, and physically belongs to bucket A, that is, OSD2 belongs to the real OSD.
为了使本发明实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明实施例中技术方案作进一步详细的说明。In order to make the above objects, features and advantages of the embodiments of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.
请参见图1,为本发明实施例提供的一种数据处理方法的流程示意图,其中,该数据处理方法可以应用于Ceph集群的监视器,如图1所示,该数据处理方法可以包括:Please refer to Fig. 1, it is a schematic flow chart of a kind of data processing method that the embodiment of the present invention provides, and wherein, this data processing method can be applied to the monitor of Ceph cluster, as shown in Fig. 1, this data processing method can comprise:
步骤101、对于Ceph集群的任一pool,当该pool的副本数与被指定为故障域的bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组。Step 101. For any pool in the Ceph cluster, when the number of copies of the pool is equal to the number of bucket nodes designated as fault domains, add the pool to the first type of pool group corresponding to the bucket node.
本发明实施例中,对于Ceph集群中的任一pool组,监视器可以判断该pool的副本数(即该pool的数据冗余策略指定的副本数)是否与被指定为故障域的bucket节点的数量相等。In the embodiment of the present invention, for any pool group in the Ceph cluster, the monitor can judge whether the number of copies of the pool (that is, the number of copies specified by the data redundancy policy of the pool) is the same as that of the bucket node designated as the fault domain. The quantity is equal.
例如,当Ceph集群中新创建pool时,监视器可以判断该pool的副本数是否与被指定为故障域的bucket节点的数量相等。For example, when a pool is newly created in a Ceph cluster, the monitor can determine whether the number of replicas of the pool is equal to the number of bucket nodes designated as failure domains.
本发明实施例中,当监视器确定pool的副本数与被指定为故障域的bucket节点的数量相等时,监视器可以将该pool加入到该bucket节点对应的第一类型pool组。In the embodiment of the present invention, when the monitor determines that the number of copies of the pool is equal to the number of bucket nodes designated as fault domains, the monitor may add the pool to the first type pool group corresponding to the bucket node.
在本发明其中一个实施例中,上述将该pool加入该bucket节点对应的第一类型pool组,包括:In one of the embodiments of the present invention, adding the pool to the first type pool group corresponding to the bucket node includes:
判断是否存在bucket节点对应的第一类型pool组;Determine whether there is a first-type pool group corresponding to the bucket node;
若存在,则将该pool加入该bucket节点对应的第一类型pool组;If it exists, add the pool to the first type pool group corresponding to the bucket node;
否则,创建该bucket节点对应的第一类型pool组,将该pool加入该bucket对应的第一类型pool组,并创建对应该第一类型pool组的CRUSH map(信息图)。Otherwise, create a first-type pool group corresponding to the bucket node, add the pool to the first-type pool group corresponding to the bucket, and create a CRUSH map (information map) corresponding to the first-type pool group.
在该实施例中,当监视器确定pool的副本数与被指定为故障域的bucket节点的数量相等时,监视器确定需要将该pool加入到该bucket节点(即指定为该pool的故障域的bucket节点)对应的第一类型pool组。此时,监视器可以先判断是否存在该bucket节点对应的第一类型pool组;若存在,则监视器可以直接将该pool加入该bucket节点对应的第一类型pool组;否则,监视器可以创建该bucket节点对应的第一类型pool组,将该pool加入该bucket对应的第一类型pool组,并创建对应该第一类型pool组的CRUSH map,其具体实现可以参见现有Ceph集群中创建CRUSH map的相关实现,本发明实施例在此不做赘述。In this embodiment, when the monitor determines that the number of replicas of the pool is equal to the number of bucket nodes designated as the fault domain, the monitor determines that the pool needs to be added to the bucket node (that is, the node designated as the fault domain of the pool) bucket node) corresponds to the first type of pool group. At this point, the monitor can first determine whether there is a first-type pool group corresponding to the bucket node; if it exists, the monitor can directly add the pool to the first-type pool group corresponding to the bucket node; otherwise, the monitor can create The first type of pool group corresponding to the bucket node, add the pool to the first type of pool group corresponding to the bucket, and create a CRUSH map corresponding to the first type of pool group. For specific implementation, see Create CRUSH in the existing Ceph cluster The related implementation of map will not be described in detail here in this embodiment of the present invention.
步骤102、对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后的第一类型pool组的第二理论存储使用率大于第一理论存储使用率,并在迁移完成后更新该第一类型pool组对应的CRUSH信息图。Step 102, for any first type pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger the storage unit between the bucket nodes corresponding to the first type pool group Migrate, so that the second theoretical storage usage of the first type of pool group after migration is greater than the first theoretical storage usage, and update the CRUSH information map corresponding to the first type of pool group after the migration is completed.
本发明实施例中,考虑到对于副本数与被指定为故障域的bucket节点的数量相等的情况,当被指定为故障域的各bucket节点之间的存储容量相差较多时,存储容量较小的bucket节点会存储存储容量瓶颈,因此,可以通过在该被指定为故障域的多个bucket节点之间进行存储单元的迁移,使各bucket节点的存储容量更加均衡,提高存储使用率。In the embodiment of the present invention, considering that the number of replicas is equal to the number of bucket nodes designated as fault domains, when the storage capacity of bucket nodes designated as fault domains differs greatly, the one with smaller storage capacity The bucket node will store storage capacity bottlenecks. Therefore, the storage capacity of each bucket node can be more balanced and the storage utilization rate can be improved by migrating storage units between multiple bucket nodes designated as fault domains.
相应地,在本发明实施例中,对于任一第一类型pool组,监视器可以计算该第一类型pool组的理论存储使用率(本文中称为第一理论存储使用率),并判断该第一理论存储使用率是否小于预设使用率阈值,若小于,则说明第一类型pool组对应的各bucket节点的存储使用率将会较低,会导致存储资源浪费,因而触发该第一类型pool组对应的bucket节点之间的存储单元迁移。Correspondingly, in this embodiment of the present invention, for any pool group of the first type, the monitor can calculate the theoretical storage usage rate (referred to as the first theoretical storage usage rate herein) of the pool group of the first type, and determine the Whether the first theoretical storage usage rate is less than the preset usage rate threshold, if it is smaller, it means that the storage usage rate of each bucket node corresponding to the first type pool group will be low, which will lead to waste of storage resources, thus triggering the first type Storage unit migration between bucket nodes corresponding to the pool group.
其中,上述预设使用率阈值可以根据实际场景下可以接受的存储使用率来设定,例如,可以设置为80%、90%等。Wherein, the aforementioned preset usage rate threshold may be set according to an acceptable storage usage rate in an actual scenario, for example, may be set to 80%, 90% or the like.
本发明实施例中,监视器对第一类型pool组对应的bucket节点进行存储单元迁移时,需要保证迁移后该第一类型pool组的理论存储使用率(本文中称为第二理论使用率)大于第一理论存储使用率,即通过存储单元的迁移提高第一类型pool组的理论存储使用率。In the embodiment of the present invention, when the monitor migrates the storage unit of the bucket node corresponding to the first type of pool group, it needs to ensure the theoretical storage utilization rate of the first type of pool group after migration (herein referred to as the second theoretical utilization rate) It is greater than the first theoretical storage usage, that is, the theoretical storage usage of the first type pool group is increased through the migration of storage units.
例如,可以从权重较大的bucket节点中迁出一定存储容量的存储单元,并将其迁入权重较小的bucket节点;或者,将存储容量为M1的存储单元从权重较大的bucket节点迁移至权重较小的bucket节点,并将存储容量为M2的存储单元从权重较小的bucket节点迁移至权重较大的bucket节点,权重,M2小于M1。For example, a storage unit with a certain storage capacity can be migrated from a bucket node with a large weight and moved into a bucket node with a small weight; or, a storage unit with a storage capacity of M1 can be migrated from a bucket node with a large weight to the bucket node with a smaller weight, and migrate the storage unit with a storage capacity of M2 from the bucket node with a smaller weight to the bucket node with a larger weight, and the weight, M2, is smaller than M1.
其中,当存储单元迁移后,各bucket节点的存储容量(即权重)=bucket节点初始存储容量+迁入的存储单元的存储容量-迁出的存储单元的存储容量。Wherein, after the storage unit is migrated, the storage capacity (ie weight) of each bucket node=the initial storage capacity of the bucket node+the storage capacity of the storage unit moved in−the storage capacity of the storage unit moved out.
例如,假设第一类型pool组对应的bucket节点包括bucket A、bucket B和bucketC,且各bucket节点的权重分别为30、30和18,监视器控制bucket A和bucket B分别迁移了2T存储容量的存储单元至bucket C,则迁移完成后,bucket A和bucket B的权重变化为28,bucket的权重变化为22,该第一类型pool组的理论存储使用率变化为85%(22/[(28+28+22)/3]=85%)。For example, assuming that the bucket nodes corresponding to the first type of pool group include bucket A, bucket B, and bucket C, and the weights of each bucket node are 30, 30, and 18 respectively, the monitor controls bucket A and bucket B to migrate 2T storage capacity After the storage unit is moved to bucket C, after the migration is completed, the weights of bucket A and bucket B change to 28, the weight of bucket changes to 22, and the theoretical storage usage rate of the first type pool group changes to 85% (22/[(28 +28+22)/3]=85%).
本发明实施例中,在完成bucket节点之间的存储单元迁移之后,监视器可以更新该第一类型pool组对应的CRUSH map,其具体实现可以参见现有Ceph集群中发生存储单元增删时CRUSH map的更新实现,本发明实施例在此不再赘述。In the embodiment of the present invention, after the storage unit migration between bucket nodes is completed, the monitor can update the CRUSH map corresponding to the first type of pool group. For the specific implementation, please refer to the CRUSH map when storage units are added or deleted in the existing Ceph cluster The updated implementation of , the embodiment of the present invention will not be repeated here.
进一步地,在本发明实施例中,监视器需要维护第一类型pool组下的成员列表,即记录各第一类型pool组与归属于该第一类型pool组的pool组的对应关系,并将该第一类型pool组下的成员列表(包括pool与第一类型pool组的对应关系)下发给Ceph集群节点(即Ceph集群中的服务器节点)及客户端,以使所述客户端根据pool与第一类型pool组的对应关系进行数据读写处理。Further, in the embodiment of the present invention, the monitor needs to maintain the member list under the first type pool group, that is, record the correspondence between each first type pool group and the pool group belonging to the first type pool group, and The list of members under the first type pool group (comprising the correspondence between the pool and the first type pool group) is sent to the Ceph cluster node (ie, the server node in the Ceph cluster) and the client, so that the client can The corresponding relationship with the first type of pool group performs data read and write processing.
相应地,当客户端接收到针对目标pool的数据读写请求时,可以查询自身记录的pool与第一类型pool组的对应关系,确定是否存在目标pool对应的目标第一类型pool组(即判断目标pool是否归属于第一类型pool组);若查询到目标pool对应的目标第一类型pool组,则客户端可以根据目标第一类型pool组对应的CRUSH map确定读写请求命中的目标OSD,并对目标OSD进行数据读写处理。Correspondingly, when the client receives a data read and write request for the target pool, it can query the correspondence between the pool recorded by itself and the first-type pool group, and determine whether there is a target first-type pool group corresponding to the target pool (that is, judge Whether the target pool belongs to the first-type pool group); if the target first-type pool group corresponding to the target pool is queried, the client can determine the target OSD hit by the read and write request according to the CRUSH map corresponding to the target first-type pool group, And read and write data to the target OSD.
需要说明的是,本发明实施例中,第一类型pool组对应的CRUSH map可以通过对现有CRUSH map进行扩展来得到,即在现有CRUSH map新增type(类型)字段来标识其所属的第一类型pool组,该type字段与第一类型pool组对应的bucket节点的类型保持一致,例如,当bucket节点的类型为机房时,CRUSH map中的type字段可以为type1;当bucket节点为机架时,CRUSH map中的type字段可以为type2。It should be noted that, in the embodiment of the present invention, the CRUSH map corresponding to the first type of pool group can be obtained by extending the existing CRUSH map, that is, adding a type (type) field to the existing CRUSH map to identify the pool to which it belongs The first type of pool group, the type field is consistent with the type of the bucket node corresponding to the first type of pool group, for example, when the type of the bucket node is a computer room, the type field in the CRUSH map can be type1; When racking, the type field in the CRUSH map can be type2.
可见,在图1所示的方法流程中,通过设置第一类型pool组,将ceph集群中符合条件的pool加入到对应的第一类型pool组,并当存在第一类型pool组的理论存储使用率小于预设使用率阈值时,对该第一类型pool组对应的bucket节点进行存储单元迁移,以提高该第一类型pool组的理论存储使用率,进而可以提高Ceph集群的存储使用率。It can be seen that in the method flow shown in Figure 1, by setting the first type pool group, the eligible pools in the ceph cluster are added to the corresponding first type pool group, and when there is a theoretical storage of the first type pool group When the rate is less than the preset usage threshold, the storage unit migration is performed on the bucket node corresponding to the first type of pool group to increase the theoretical storage usage of the first type of pool group, thereby increasing the storage usage of the Ceph cluster.
在本发明其中一个实施例中,上述触发该第一类型pool组对应的bucket节点之间的存储单元迁移,包括:In one embodiment of the present invention, the triggering of storage unit migration between bucket nodes corresponding to the first type of pool group includes:
以第二理论存储使用率大于等于预设使用率阈值的原则对该第一类型pool组对应的bucket节点之间的存储单元进行迁移;Migrate the storage units between the bucket nodes corresponding to the first type pool group according to the principle that the second theoretical storage usage rate is greater than or equal to the preset usage rate threshold;
当不存在使第二理论存储使用率大于等于预设使用率阈值的迁移方案时,以第二理论存储使用率与预设使用率阈值二者之间的差值的绝对值最小的原则对该第一类型pool组对应的bucket节点之间的存储单元进行迁移;When there is no migration plan that makes the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, the absolute value of the difference between the second theoretical storage usage rate and the preset usage rate threshold value is the smallest. The storage units between the bucket nodes corresponding to the first type of pool group are migrated;
当存在多个使第二理论存储使用率大于等于预设使用率阈值的迁移方案,或存在多个使第二理论存储使用率与预设使用率阈值二者之间的差值的绝对值最小的迁移方案时,通过以下原则中的一个或多个确定实际使用的迁移方案:When there are multiple migration schemes that make the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, or there are multiple migration schemes that minimize the absolute value of the difference between the second theoretical storage usage rate and the preset usage rate threshold When choosing a migration plan, determine the actual migration plan to use by one or more of the following principles:
迁出存储单元数量少的迁移方案优先、有迁出存储单元的bucket节点数多的迁移方案优先、相同bucket节点迁出的存储单元分布集中的迁移方案优先。The migration scheme with a small number of storage units to be migrated out is preferred, the migration scheme with a large number of bucket nodes with storage units to be migrated out is preferred, and the migration scheme with a concentrated distribution of storage units that are migrated out of the same bucket node is preferred.
在该实施例中,为了优化存储单元迁移的效果,需要尽量保证第二理论存储使用率大于等于第一理论存储使用率。In this embodiment, in order to optimize the effect of storage unit migration, it is necessary to ensure that the second theoretical storage usage rate is greater than or equal to the first theoretical storage usage rate.
而考虑到实际场景中,是以存储单元为单位进行迁移的,而存储单元的存储容量并不是任意可变的,因此,在某些场景下,可能无法通过存储单元的迁移使第二理论存储使用率大于等于第一理论存储使用率。Considering that in actual scenarios, migration is performed in units of storage units, and the storage capacity of storage units is not arbitrarily variable. Therefore, in some scenarios, it may not be possible to make the second theoretical storage The usage rate is greater than or equal to the first theoretical storage usage rate.
例如,假设第一类型pool组对应的被指定为故障域的bucket节点包括bucket A和bucket B,其权重分别为3和1.8(即存储容量分别为3T和1.8T),且bucket A中的存储单元的存储容量均为1T(即包括3个1T的存储单元),bucket B中的存储单元的存储容量分别为1T和0.8T(即包括1个1T的存储单元和1个0.8T的存储单元),则该第一类型pool组通过存储单元迁移的方式能达到的最高理论存储使用率为83%(20/[(28+20)/2]=83%),即从bucket A中迁移一个1T的存储单元至bucket B,并从bucket B中迁移一个0.8T的存储单元至bucket A,因此,在该场景中,当预设存储使用率阈值高于83%,如85%、90%等,则无法通过存储单元迁移的方式使第二理论存储使用率大于等于第一理论存储使用率。For example, assume that the bucket nodes designated as fault domains corresponding to the first type of pool group include bucket A and bucket B, whose weights are 3 and 1.8 respectively (that is, the storage capacity is 3T and 1.8T respectively), and the storage capacity in bucket A The storage capacity of each unit is 1T (that is, including three 1T storage units), and the storage capacities of the storage units in bucket B are 1T and 0.8T (that is, including one 1T storage unit and one 0.8T storage unit) ), then the highest theoretical storage utilization rate that the first type pool group can achieve by means of storage unit migration is 83% (20/[(28+20)/2]=83%), that is, migrate a A 1T storage unit is transferred to bucket B, and a 0.8T storage unit is migrated from bucket B to bucket A. Therefore, in this scenario, when the preset storage usage threshold is higher than 83%, such as 85%, 90%, etc. , then the second theoretical storage usage rate cannot be made greater than or equal to the first theoretical storage usage rate by means of storage unit migration.
在该实施例中,当不存在使第二理论存储使用率大于等于预设使用率阈值的迁移方案时,可以通过迁移存储单元使第二理论存储使用率尽量接近预设存储使用率阈值。In this embodiment, when there is no migration scheme that makes the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, the second theoretical storage usage rate may be as close to the preset storage usage rate threshold as possible by migrating storage units.
进一步地,在该实施例中,当存在多个使第二理论存储使用率大于等于预设使用率阈值的迁移方案,或存在多个使第二理论存储使用率与预设使用率阈值二者之间的差值的绝对值最小的迁移方案时,监视器可以根据迁出存储单元数量少的迁移方案优先、有迁出存储单元的bucket节点数多的迁移方案优先、或/和相同bucket节点迁出的存储单元分布集中的迁移方案优先的原则确定实际使用的迁移方案。Further, in this embodiment, when there are multiple migration schemes that make the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, or there are multiple migration schemes that make the second theoretical storage usage rate and the preset usage rate threshold value When the migration plan with the smallest absolute value of the difference between them is selected, the monitor can give priority to the migration plan with a small number of storage units to be moved out, the migration plan with a large number of bucket nodes with storage units to be moved out, or/and the same bucket node The principle of prioritizing the migration scheme in the distribution set of the storage units to be moved out determines the migration scheme actually used.
例如,假设存在多个使第二理论存储使用率大于等于预设使用率阈值的迁移方案,监视器可以比较各迁移方案需要迁出的存储单元的数量,并将迁出存储单元数量最少的迁移方案确定为实际使用的迁移方案;若各迁移方案需要迁出的存储单元的数量相同,则可以进步一比较各迁移方案中有迁出存储单元的bucket节点数,将有迁出存储单元的bucket节点数最多的迁出方案确定为实际使用的迁移方案;若各迁移方案中有迁出存储单元的bucket节点数也相同,则将各迁移方案中相同bucket节点迁出的存储单元分布最集中的迁移方案确定为实际使用的迁移方案。For example, assuming that there are multiple migration schemes that make the second theoretical storage usage greater than or equal to the preset usage threshold, the monitor can compare the number of storage units that need to be migrated out of each migration scheme, and select the migration scheme with the least number of storage units to be migrated out. The scheme is determined as the actual migration scheme; if the number of storage units to be migrated out is the same for each migration scheme, you can further compare the number of bucket nodes with migration storage units in each migration scheme, and there will be buckets with migration storage units The migration scheme with the largest number of nodes is determined to be the migration scheme actually used; if the number of bucket nodes with migration storage units in each migration scheme is also the same, then the migration scheme with the most concentrated storage units of the same bucket node in each migration scheme is determined. The migration scenario is determined as the actual migration scenario used.
例如,迁移方案1中需要将bucket A的存储单元分别迁移至bucket B和bucket C,而迁移方案2中仅需要将bucket A的存储单元迁移至bucket C,则迁移方案1中相同bucket节点迁出的存储单元分布更集中。For example, in migration plan 1, the storage unit of bucket A needs to be migrated to bucket B and bucket C respectively, but in migration plan 2, only the storage unit of bucket A needs to be migrated to bucket C, and the same bucket nodes in migration plan 1 need to be migrated out The distribution of storage units is more concentrated.
在该实施例中,若根据上述原则仍未确定出实际使用的迁移方案,则可以从多个迁移方案中选择一个作为实际使用的迁移方案,或根据其它策略进一步确定,其具体实现在此不做赘述。In this embodiment, if the migration scheme actually used has not been determined according to the above principles, one may be selected from multiple migration schemes as the migration scheme actually used, or further determined according to other strategies, and its specific implementation is not described here. Do repeat.
应该认识到,上述迁移方案的确定原则仅仅是本发明实施例中的几种具体示例,而并不是对本发明保护范围的限定,即本发明实施例中也可以通过其它原则确定实际使用的迁移方案,如有迁出存储单元的bucket节点数少的迁移方案优先或者随机选择等,其具体实现在此不做赘述。It should be recognized that the principles for determining the above-mentioned migration scheme are only several specific examples in the embodiments of the present invention, and are not intended to limit the protection scope of the present invention, that is, in the embodiments of the present invention, other principles can also be used to determine the migration scheme actually used , if there is a migration scheme with a small number of bucket nodes to migrate out of the storage unit, priority or random selection, etc., the specific implementation will not be described here.
进一步地,在本发明其中一个实施例中,上述触发该第一类型pool组对应的bucket节点之间的存储单元迁移之后,还可以包括:Further, in one of the embodiments of the present invention, after triggering the storage unit migration between the bucket nodes corresponding to the first type pool group, it may also include:
当检测到针对该第一类型pool组对应的bucket节点的存储单元增删操作时,确定该第一类型pool组未进行存储单元迁移的情况下,进行该增删操作后的第三理论存储使用率;When detecting the addition and deletion operation of the storage unit corresponding to the bucket node of the first type pool group, the third theoretical storage usage rate after the addition and deletion operation is performed when it is determined that the storage unit migration is not performed for the first type pool group;
当第三理论存储使用率大于第二理论存储使用率时,删除该第一类型pool组,并将该第一类型pool组中的各pool加入第二类型pool组,并重新确定已经加入第二类型pool组的原第一类型pool组中的各pool中的PG与OSD的映射关系。When the third theoretical storage usage rate is greater than the second theoretical storage usage rate, delete the first type pool group, and add each pool in the first type pool group to the second type pool group, and re-determine that it has joined the second pool group The mapping relationship between PG and OSD in each pool in the original first type pool group of the type pool group.
在该实施例中,考虑到管理员可能会通过在权重较小的bucket节点中增加存储单元,或删减权重较大的bucket节点中的存储单元的方式来改善存储使用率,在该情况下,上述存储单元迁移可能反而会降低存储使用率,因此,当监视器对第一类型pool组进行了存储单元迁移之后,若监视器检测到针对该第一类型pool组对应的bucket节点的存储单元增删操作时,监视器可以确定该第一类型pool在未进行存储单元迁移的情况下,进行该增删操作后的理论存储使用率(本文中称为第三理论存储使用率)。In this embodiment, it is considered that the administrator may increase the storage unit in the bucket node with a small weight, or delete the storage unit in the bucket node with a large weight to improve the storage usage. In this case , the above storage unit migration may reduce the storage usage instead. Therefore, after the monitor performs storage unit migration for the first type pool group, if the monitor detects the storage unit for the bucket node corresponding to the first type pool group During the addition and deletion operation, the monitor can determine the theoretical storage usage rate of the first type pool after the addition and deletion operation (referred to as the third theoretical storage usage rate herein) without performing the storage unit migration.
例如,假设第一类型pool组对应的bucket节点包括bucket A、bucket B和bucketC,各bucket节点的权重分别为30、30和18,监视器按照上述方式在各bucket节点之间进行了存储单元迁移之后,某一时刻,管理员在bucket C节点中增加了9T的存储单元(即bucketC的实际权重(不考虑存储单元迁移)变化为27),监视器确定在未进行存储单元迁移的情况下,进行该增删操作后的第三理论存储使用率为93%(27/[(30+30+27)/3]=93%)。For example, assuming that the bucket nodes corresponding to the first type of pool group include bucket A, bucket B, and bucket C, and the weights of each bucket node are 30, 30, and 18 respectively, the monitor performs storage unit migration between bucket nodes in the above-mentioned manner Later, at a certain moment, the administrator added a 9T storage unit to the bucket C node (that is, the actual weight of bucket C (without considering the storage unit migration) changed to 27), and the monitor determined that the storage unit migration was not performed. The third theoretical storage usage rate after performing the addition and deletion operation is 93% (27/[(30+30+27)/3]=93%).
在该实施例中,当第三理论存储使用率大于第二理论存储使用率时,监视器可以删除该第一类型pool组,将该第一类型pool组中的各pool加入到第二类型pool组,并重新确定已经加入第二类型pool组的原第一类型pool组中的各pool的PG与OSD的映射关系,即根据第二类pool组对应的crushmap通过CRUSH算法对PG与OSD的映射关系进行调整。In this embodiment, when the third theoretical storage usage rate is greater than the second theoretical storage usage rate, the monitor can delete the first type pool group, and add each pool in the first type pool group to the second type pool Group, and re-determine the mapping relationship between PG and OSD of each pool in the original first-type pool group that has been added to the second-type pool group, that is, according to the crushmap corresponding to the second-type pool group through the CRUSH algorithm to map PG and OSD The relationship is adjusted.
举例来说,假设原第一类型pool组中包括pool1和pool2,当删除该第一类型pool组,并将该第一类型pool组中的各pool(即pool1和pool2)加入到第二类型pool组之后,需要重新确定该已经加入第二类型pool组的原第一类型pool组中的各pool(即pool1和pool2)的PG与OSD的映射关系。For example, assuming that the original first-type pool group includes pool1 and pool2, when the first-type pool group is deleted, and each pool (namely pool1 and pool2) in the first-type pool group is added to the second-type pool After grouping, it is necessary to re-determine the mapping relationship between PG and OSD of each pool (ie pool1 and pool2) in the original first-type pool group that has been added to the second-type pool group.
进一步地,在本发明实施例中,当第一类型pool组对应的bucket节点的数量增加时,该第一类型pool组中各成员pool将不满足副本数与被指定为故障域的bucket节点的数量相等的条件,此时,监视器也需要将该第一类型pool组删除,将该第一类型pool组中的各pool加入第二类型pool组,并重新确定已经加入第二类型pool组的原第一类型pool组中的各pool中的归置组PG与对象存储设备OSD的映射关系,其具体实现可以参见发生存储单元增删操作时的相应处理,本发明实施例在此不再赘述。Further, in the embodiment of the present invention, when the number of bucket nodes corresponding to the first type pool group increases, each member pool in the first type pool group will not meet the requirements of the number of replicas and the bucket nodes designated as the failure domain. In this case, the monitor also needs to delete the first-type pool group, add each pool in the first-type pool group to the second-type pool group, and re-determine the pools that have joined the second-type pool group The specific implementation of the mapping relationship between the placement group PG and the object storage device OSD in each pool in the original first type pool group can refer to the corresponding processing when storage unit addition and deletion operations occur, and the embodiments of the present invention will not repeat them here.
进一步地,在本发明实施例中,考虑到第一类型pool组对应的各bucket节点之间进行存储单元迁移之后,该第一类型pool组的成员pool中的PG对应的多个OSD可能会归属于同一bucket节点。例如,在3副本场景下,PG分别映射到bucket A的一个实OSD、bucket B的一个虚OSD和bucket C的一个虚OSD,且在bucket B和bucket C上映射的虚OSD均是从bucket A迁移出去的,即均物理归属于bucket A,此时,该PG映射的3个OSD均归属于bucketA,会存在单点故障风险,即当bucket A故障时,该PG的数据将无法恢复。Further, in the embodiment of the present invention, considering that after storage unit migration is performed between bucket nodes corresponding to the first type of pool group, multiple OSDs corresponding to PGs in the member pools of the first type of pool group may belong to on the same bucket node. For example, in a three-replica scenario, PGs are mapped to a real OSD of bucket A, a virtual OSD of bucket B, and a virtual OSD of bucket C, and the virtual OSDs mapped on bucket B and bucket C are all from bucket A. All the migrated OSDs belong to bucket A physically. At this time, the three OSDs mapped by the PG belong to bucket A, and there is a single point of failure risk. That is, when bucket A fails, the data of the PG cannot be recovered.
相应地,在本发明其中一个实施例中,上述触发该第一类型pool组对应的bucket节点之间的存储单元迁移之后,还包括:Correspondingly, in one embodiment of the present invention, after triggering the migration of the storage units between the bucket nodes corresponding to the first type pool group, further includes:
对于该第一类型pool组中任一PG,当该PG对应的全部OSD物理归属于同一个bucket节点时,从该PG对应的全部OSD中选择一个虚OSD,删除该PG与该虚OSD的映射关系,并在该虚OSD逻辑归属的bucket节点中重新计算出一个OSD,该OSD与该PG对应的其它OSD物理归属于不同bucket节点。For any PG in the first type of pool group, when all the OSDs corresponding to the PG physically belong to the same bucket node, select a virtual OSD from all the OSDs corresponding to the PG, and delete the mapping between the PG and the virtual OSD relationship, and recalculate an OSD in the bucket node to which the virtual OSD logically belongs. This OSD and other OSDs corresponding to the PG physically belong to different bucket nodes.
在该实施例中,对于第一类型pool组中的任一PG,在通过CRUSH算法计算出该PG映射的OSD后,可以统计该PG映射的OSD的虚OSD的数量,当虚OSD的数量=(bucket节点数-1)时,则需要进一步读取虚OSD信息以判断虚OSD物理归属的bucket节点是否与仅有的实OSD物理归属的bucket节点相同,当所有虚OSD物理归属的bucket节点均与实OSD物理归属的bucket节点相同时,可以从该PG映射的OSD的虚OSD中选择一个虚OSD,例如,可以选择虚OSD逻辑归属的bucket节点的ID(标识)最小的虚OSD,确定该PG与该虚OSD的映射关系无效,删除该PG与该虚OSD的映射关系,重新选择参数r进行计算,在该虚OSD逻辑归属的bucket节点中重新计算出一个OSD,若该OSD与该PG映射的其它OSD物理归属于同一bucket节点,则需要继续重新选择参数r进行计算,直至计算出一个与该PG对应的其它OSD物理归属于不同bucket节点,或重新计算的次数达到预设上限。In this embodiment, for any PG in the first type of pool group, after the OSD mapped by the PG is calculated by the CRUSH algorithm, the number of virtual OSDs of the OSD mapped by the PG can be counted. When the number of virtual OSDs = (number of bucket nodes-1), it is necessary to further read the virtual OSD information to determine whether the bucket node to which the virtual OSD physically belongs is the same as the bucket node to which the only real OSD physically belongs. When the bucket node to which the real OSD physically belongs is the same, a virtual OSD can be selected from the virtual OSDs of the OSDs mapped by the PG. The mapping relationship between the PG and the virtual OSD is invalid. Delete the mapping relationship between the PG and the virtual OSD, reselect the parameter r for calculation, and recalculate an OSD in the bucket node to which the virtual OSD logically belongs. If the OSD and the PG If other mapped OSDs physically belong to the same bucket node, you need to continue to reselect the parameter r for calculation until it is calculated that another OSD corresponding to the PG physically belongs to a different bucket node, or the number of recalculations reaches the preset upper limit.
其中,crush算法中是通过存储设备的权重来计算数据对象的分布的。计算过程中通过集群map、数据分布策略以及随机数确定数据对象最终的存储位置。上述参数r,即指的是随机数。Among them, in the crush algorithm, the distribution of data objects is calculated through the weight of the storage device. During the calculation process, the final storage location of the data object is determined through the cluster map, data distribution strategy and random number. The above parameter r refers to a random number.
其中,当重新计算的次数达到预设上限,且仍未计算出一个与该PG对应的其它OSD物理归属于不同bucket节点的OSD,则可以多次计算出的OSD中选择一个作为最终使用的OSD,例如,可以选择最后一次计算出的OSD作为最终使用的OSD。Among them, when the number of recalculations reaches the preset upper limit, and no other OSD corresponding to the PG physically belongs to the OSD of a different bucket node, one of the OSDs calculated multiple times can be selected as the final OSD , for example, the last calculated OSD can be selected as the final used OSD.
为了使本领域技术人员更好地理解本发明实施例提供的技术方案,下面结合具体应用场景对本发明实施例提供的技术方案进行说明。In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present invention, the technical solutions provided by the embodiments of the present invention will be described below in conjunction with specific application scenarios.
请参见图2A,为本发明实施例提供的一种应用场景中被指定为故障域的bucket节点的示意图,如图2A所示,在该应用场景中,被指定为故障域的bucket节点的数量为3(假设分别为bucket A、bucket B和bucket C),其中,bucket A、bucket B和bucket C的权重分别为30、30和18。假设预设使用率阈值(V1)为85%。Please refer to FIG. 2A, which is a schematic diagram of a bucket node designated as a fault domain in an application scenario provided by an embodiment of the present invention. As shown in FIG. 2A, in this application scenario, the number of bucket nodes designated as a fault domain is 3 (assuming bucket A, bucket B, and bucket C respectively), where the weights of bucket A, bucket B, and bucket C are 30, 30, and 18, respectively. Assume that the preset usage threshold (V1) is 85%.
基于该应用场景,本发明实施例提供的数据处理实现如下:Based on this application scenario, the data processing provided by the embodiment of the present invention is implemented as follows:
1、当Ceph集群中新创建一个pool(假设为pool1),且该pool的数据冗余策略指定的副本数为3,即该pool的副本数与被指定为故障域的bucket节点的数量相等,因此,监视器可以将该pool加入到上述bucket节点对应的等额bucket组(假设为等额bucket组1);1. When a pool (assumed to be pool1) is newly created in the Ceph cluster, and the number of copies specified by the data redundancy policy of the pool is 3, that is, the number of copies of the pool is equal to the number of bucket nodes designated as the fault domain, Therefore, the monitor can add the pool to the equal bucket group corresponding to the above bucket node (assumed to be equal bucket group 1);
其中,假设还未创建上述bucket节点对应的等额bucket组,则可以创建该等额bucket组,将该pool加入到该等额bucket组,并创建该等额bucket组对应的CRUSH map,该CRUSH map设置有对应上述bucket节点的type字段;监视器将该pool加入到等额bucket组之后,需要维护该等额bucket组下的pool(即成员pool)列表,并将该等额bucket组下的pool列表下发给Ceph集群节点及客户端,其下发方式与当前集群map的下发方式相同,不再进行详细说明;Among them, assuming that the equal-amount bucket group corresponding to the above bucket node has not been created, you can create the equal-amount bucket group, add the pool to the equal-amount bucket group, and create a CRUSH map corresponding to the equal-amount bucket group. The CRUSH map is set with a corresponding The type field of the above bucket node; after the monitor adds the pool to the equal bucket group, it needs to maintain the list of pools (that is, member pools) under the equal bucket group, and send the list of pools under the equal bucket group to the Ceph cluster For nodes and clients, the distribution method is the same as that of the current cluster map, so no detailed description will be given;
2、监视器计算等额bucket组1的理论存储使用率V2,V2=18/[(30+30+18)/3]=70%,确定V2<V1,因此,监视器触发bucket A、bucket B和bucket C之间进行存储单元迁移。2. The monitor calculates the theoretical storage usage rate V2 of the equal bucket group 1, V2=18/[(30+30+18)/3]=70%, and confirms that V2<V1, therefore, the monitor triggers bucket A and bucket B The storage unit is migrated between bucket C and bucket C.
其中,由于bucket A和bucket B的权重均大于bucket C,且bucket A和bucket B的权重相同,因此,可以分别从bucket A和bucket B中迁移一定存储容量的存储单元至bucket C,其示意图可以如图2B所示,其中,bucket C中虚线柱体为从bucket A或bucket B中迁移至bucket C的存储单元,这些存储单元对应的OSD即为虚OSD。Among them, since the weights of bucket A and bucket B are greater than that of bucket C, and the weights of bucket A and bucket B are the same, storage units with a certain storage capacity can be migrated from bucket A and bucket B to bucket C respectively, and the schematic diagram can be As shown in FIG. 2B , the dotted cylinders in bucket C are storage units migrated from bucket A or bucket B to bucket C, and the OSDs corresponding to these storage units are virtual OSDs.
在该实施例中,假设分别从bucket A和bucket B分别迁移4T的存储容量至bucketC,则迁移后的bucket A、bucket B和bucket C的权重变化为26、26和26,迁移后的理论存储使用率V3=26/[(26+26+26)/3]=100%。In this example, assuming that the storage capacity of 4T is migrated from bucket A and bucket B to bucket C respectively, the weight changes of bucket A, bucket B and bucket C after migration are 26, 26 and 26, and the theoretical storage capacity after migration The usage rate V3=26/[(26+26+26)/3]=100%.
3、完成上述存储单元迁移后,监视器可以更新等额bucket组1对应的CRUSH map;3. After the above storage unit migration is completed, the monitor can update the CRUSH map corresponding to bucket group 1 with the same amount;
4、当客户端接收到针对pool1的数据读写请求时,监视器根据记录的各等额bucket组的成员pool列表,确定pool1对应的等额bucket组,即等额bucket组1,并根据等额bucket组1对应的CRUSH信息图确定该数据读写请求命中的OSD,并进行数据读写处理。4. When the client receives a data read and write request for pool1, the monitor determines the equal-amount bucket group corresponding to pool1, that is, equal-amount bucket group 1, according to the recorded member pool list of each equal-amount bucket group, and according to the equal-amount bucket group 1 The corresponding CRUSH information map determines the OSD hit by the data read and write request, and performs data read and write processing.
通过以上描述可以看出,在本发明实施例提供的技术方案中,对于Ceph集群的任一pool,当该pool的副本数与被指定为故障域的bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于第一理论存储使用率,并在迁移完成后更新该第一类型pool组对应的CRUSH信息图,提高了第一类型pool组的理论存储使用率,进而可以提高Ceph集群的存储使用率。As can be seen from the above description, in the technical solution provided by the embodiment of the present invention, for any pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket nodes designated as the fault domain, the pool will be added The first type of pool group corresponding to the bucket node; for any first type of pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger the first type of pool group to correspond to Migrate storage units between bucket nodes so that the second theoretical storage usage rate of the first type pool group after migration is greater than the first theoretical storage usage rate, and update the CRUSH corresponding to the first type pool group after the migration is completed The information graph improves the theoretical storage usage of the first type of pool group, which in turn increases the storage usage of the Ceph cluster.
请参见图3,为本发明实施例提供的一种数据处理装置的结构示意图,其中,该装置可以应用于上述方法实施例中的监视器,如图3所示,该数据处理装置可以包括:Please refer to FIG. 3, which is a schematic structural diagram of a data processing device provided by an embodiment of the present invention, wherein the device can be applied to the monitor in the above-mentioned method embodiment. As shown in FIG. 3, the data processing device may include:
pool组管理单元310,用于对于所述Ceph集群的任一存储池pool,当该pool的副本数与被指定为故障域的桶bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;Pool group management unit 310, for any storage pool pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket bucket nodes designated as the fault domain, add the pool to the bucket node corresponding The first type of pool group;
迁移单元320,用于对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于所述第一理论存储使用率;The migration unit 320 is configured to, for any first type pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger the transfer between bucket nodes corresponding to the first type pool group The storage unit is migrated, so that the second theoretical storage usage rate of the first type pool group after migration is greater than the first theoretical storage usage rate;
维护单元330,用于在迁移完成后更新该第一类型pool组对应的可扩展哈希下的受控复制CRUSH信息图。The maintenance unit 330 is configured to update the controlled replication CRUSH information map under the scalable hash corresponding to the first type of pool group after the migration is completed.
在可选实施例中,所述迁移单元320,具体用于以所述第二理论存储使用率大于等于所述预设使用率阈值的原则对该第一类型pool组对应的bucket节点之间的存储单元进行迁移;In an optional embodiment, the migration unit 320 is specifically configured to use the principle that the second theoretical storage usage rate is greater than or equal to the preset usage rate threshold value between the bucket nodes corresponding to the first type pool group The storage unit is migrated;
当不存在使所述第二理论存储使用率大于等于所述预设使用率阈值的迁移方案时,以所述第二理论存储使用率与所述预设使用率阈值二者之间的差值的绝对值最小的原则对该第一类型pool组对应的bucket节点之间的存储单元进行迁移。When there is no migration scheme that makes the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, use the difference between the second theoretical storage usage rate and the preset usage rate threshold The principle of the minimum absolute value of is to migrate the storage units between the bucket nodes corresponding to the first type of pool group.
在可选实施例中,所述迁移单元320,还用于当存在多个使所述第二理论存储使用率大于等于所述预设使用率阈值的迁移方案,或存在多个使所述第二理论存储使用率与所述预设使用率阈值二者之间的差值的绝对值最小的迁移方案时,通过以下原则中的一个或多个确定实际使用的迁移方案:In an optional embodiment, the migration unit 320 is further configured to: when there are multiple migration schemes that make the second theoretical storage usage rate greater than or equal to the preset usage rate threshold, or there are multiple migration schemes that make the second theoretical storage usage rate greater than or equal to the preset usage rate threshold 2. When the migration scheme with the smallest absolute value of the difference between the theoretical storage usage rate and the preset usage rate threshold is determined, the migration scheme actually used is determined by one or more of the following principles:
迁出存储单元数量少的迁移方案优先、有迁出存储单元的bucket节点数多的迁移方案优先、相同bucket节点迁出的存储单元分布集中的迁移方案优先。The migration scheme with a small number of storage units to be migrated out is preferred, the migration scheme with a large number of bucket nodes with storage units to be migrated out is preferred, and the migration scheme with a concentrated distribution of storage units that are migrated out of the same bucket node is preferred.
在可选实施例中,所述pool组管理单元310,还用于当检测到针对该第一类型pool组对应的bucket节点的存储单元增删操作时,确定该第一类型pool组未进行存储单元迁移的情况下,进行该增删操作后的第三理论存储使用率;In an optional embodiment, the pool group management unit 310 is further configured to determine that the first type of pool group does not perform a storage unit operation when detecting an addition or deletion operation of a storage unit for a bucket node corresponding to the first type of pool group In the case of migration, the third theoretical storage usage rate after the addition and deletion operation is performed;
所述pool组管理单元310,还用于当所述第三理论存储使用率大于所述第二理论存储使用率时,删除该第一类型pool组,将该第一类型pool组中的各pool加入第二类型pool组;其中,所述Ceph集群中各pool组初始状态默认归属于所述第二类型pool组;The pool group management unit 310 is further configured to delete the first type pool group when the third theoretical storage usage rate is greater than the second theoretical storage usage rate, and delete each pool in the first type pool group Join the second type pool group; Wherein, the initial state of each pool group in the Ceph cluster belongs to the second type pool group by default;
所述维护单元330,还用于重新确定该已经加入第二类型pool组的原第一类型pool组中的各pool中的归置组PG与对象存储设备OSD的映射关系。The maintenance unit 330 is further configured to re-determine the mapping relationship between the placement group PG and the object storage device OSD in each pool in the original first-type pool group that has been added to the second-type pool group.
在可选实施例中,所述维护单元330,还用于对于该第一类型pool组中任一PG,当该PG对应的多个OSD物理归属于同一个bucket节点时,从该PG对应的全部OSD中选择一个虚OSD,删除该PG与该虚OSD的映射关系,并在该虚OSD逻辑归属的bucket节点中重新计算出一个OSD,该OSD与该PG对应的其它OSD物理归属于不同bucket节点。In an optional embodiment, the maintenance unit 330 is further configured to, for any PG in the first type pool group, when multiple OSDs corresponding to the PG physically belong to the same bucket node, from the Select a virtual OSD from all OSDs, delete the mapping relationship between the PG and the virtual OSD, and recalculate an OSD in the bucket node to which the virtual OSD logically belongs. This OSD and other OSDs corresponding to the PG physically belong to different buckets node.
在可选实施例中,所述维护单元330,还用于维护pool与第一类型pool组的对应关系;In an optional embodiment, the maintenance unit 330 is also configured to maintain the correspondence between the pool and the first type of pool group;
请一并参见图4,为本发明实施例提供的另一种数据处理装置的结构示意图,如图4所示,在图3所示数据处理装置的基础上,图4所示的数据处理装置还包括:Please also refer to FIG. 4 , which is a schematic structural diagram of another data processing device provided by an embodiment of the present invention. As shown in FIG. 4 , on the basis of the data processing device shown in FIG. 3 , the data processing device shown in FIG. 4 Also includes:
下发单元340,用于将所述pool与第一类型pool组的对应关系下发给Ceph集群节点及客户端,以使所述客户端根据所述pool与第一类型pool组的对应关系进行数据读写处理。The sending unit 340 is configured to send the corresponding relationship between the pool and the first type of pool group to the Ceph cluster node and the client, so that the client performs the corresponding relationship according to the pool and the first type of pool group. Data read and write processing.
图5为本公开示例提供的一种数据处理装置的硬件结构示意图。该数据处理装置可包括处理器501、存储有机器可执行指令的机器可读存储介质502。处理器501与机器可读存储介质502可经由系统总线503通信。并且,通过读取并执行机器可读存储介质502中与数据处理逻辑对应的机器可执行指令,处理器501可执行上文描述的数据处理方法。Fig. 5 is a schematic diagram of a hardware structure of a data processing device provided by an example of the present disclosure. The data processing device may include a processor 501 and a machine-readable storage medium 502 storing machine-executable instructions. The processor 501 and the machine-readable storage medium 502 can communicate via the system bus 503 . Moreover, by reading and executing machine-executable instructions corresponding to the data processing logic in the machine-readable storage medium 502, the processor 501 can execute the data processing method described above.
本文中提到的机器可读存储介质502可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。The machine-readable storage medium 502 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, and the like. For example, the machine-readable storage medium can be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid-state hard disk, any type of storage disk (such as CD, DVD, etc.), or similar storage media, or a combination of them.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. It can be understood and implemented by those skilled in the art without creative effort.
由上述实施例可见,对于Ceph集群的任一pool,当该pool的副本数与被指定为故障域的bucket节点的数量相等时,将该pool加入该bucket节点对应的第一类型pool组;对于任一第一类型pool组,当该第一类型pool组的第一理论存储使用率小于预设使用率阈值时,触发该第一类型pool组对应的bucket节点之间的存储单元迁移,以使迁移后该第一类型pool组的第二理论存储使用率大于第一理论存储使用率,并在迁移完成后更新该第一类型pool组对应的CRUSH信息图,提高了第一类型pool组的理论存储使用率,进而可以提高Ceph集群的存储使用率。As can be seen from the foregoing embodiments, for any pool of the Ceph cluster, when the number of copies of the pool is equal to the number of bucket nodes designated as fault domains, the pool is added to the first type of pool group corresponding to the bucket node; for For any first type pool group, when the first theoretical storage usage rate of the first type pool group is less than the preset usage rate threshold, trigger storage unit migration between bucket nodes corresponding to the first type pool group, so that After the migration, the second theoretical storage usage rate of the first type pool group is greater than the first theoretical storage usage rate, and the CRUSH information map corresponding to the first type pool group is updated after the migration is completed, which improves the theoretical capacity of the first type pool group Storage usage, which in turn can improve the storage usage of the Ceph cluster.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。Other embodiments of the invention will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present invention, these modifications, uses or adaptations follow the general principles of the present invention and include common knowledge or conventional technical means in the technical field not disclosed in the present invention . The specification and examples are to be considered exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It should be understood that the present invention is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711047852.6ACN107704212B (en) | 2017-10-31 | 2017-10-31 | A data processing method and device |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711047852.6ACN107704212B (en) | 2017-10-31 | 2017-10-31 | A data processing method and device |
| Publication Number | Publication Date |
|---|---|
| CN107704212A CN107704212A (en) | 2018-02-16 |
| CN107704212Btrue CN107704212B (en) | 2019-09-06 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201711047852.6AActiveCN107704212B (en) | 2017-10-31 | 2017-10-31 | A data processing method and device |
| Country | Link |
|---|---|
| CN (1) | CN107704212B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108846009B (en)* | 2018-04-28 | 2021-02-05 | 北京奇艺世纪科技有限公司 | Copy data storage method and device in ceph |
| CN108804568B (en)* | 2018-05-23 | 2021-07-09 | 北京奇艺世纪科技有限公司 | Method and device for storing copy data in Openstack in ceph |
| CN108829738B (en)* | 2018-05-23 | 2020-12-25 | 北京奇艺世纪科技有限公司 | Data storage method and device in ceph |
| CN111381770B (en)* | 2018-12-30 | 2021-07-06 | 浙江宇视科技有限公司 | A data storage switching method, device, device and storage medium |
| CN109960470B (en)* | 2019-03-28 | 2022-07-29 | 新华三技术有限公司 | Data processing method and device and leader node |
| CN112181309A (en)* | 2020-10-14 | 2021-01-05 | 上海德拓信息技术股份有限公司 | Online capacity expansion method for mass object storage |
| CN115460230B (en)* | 2022-09-06 | 2025-05-16 | 中国科学技术大学 | A data migration method and unified coordination system |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104750624A (en)* | 2013-12-27 | 2015-07-01 | 英特尔公司 | Data Coherency Model and Protocol at Cluster Level |
| CN107133228A (en)* | 2016-02-26 | 2017-09-05 | 华为技术有限公司 | A kind of method and device of fast resampling |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9519577B2 (en)* | 2013-09-03 | 2016-12-13 | Sandisk Technologies Llc | Method and system for migrating data between flash memory devices |
| US9311377B2 (en)* | 2013-11-13 | 2016-04-12 | Palo Alto Research Center Incorporated | Method and apparatus for performing server handoff in a name-based content distribution system |
| CN104583930B (en)* | 2014-08-15 | 2017-09-08 | 华为技术有限公司 | Data migration method, controller and data migration device |
| CN107211003B (en)* | 2015-12-31 | 2020-07-14 | 华为技术有限公司 | Distributed storage system and method for managing metadata |
| CN106599308B (en)* | 2016-12-29 | 2020-01-31 | 郭晓凤 | distributed metadata management method and system |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104750624A (en)* | 2013-12-27 | 2015-07-01 | 英特尔公司 | Data Coherency Model and Protocol at Cluster Level |
| CN107133228A (en)* | 2016-02-26 | 2017-09-05 | 华为技术有限公司 | A kind of method and device of fast resampling |
| Publication number | Publication date |
|---|---|
| CN107704212A (en) | 2018-02-16 |
| Publication | Publication Date | Title |
|---|---|---|
| CN107704212B (en) | A data processing method and device | |
| CN103150347B (en) | Based on the dynamic replication management method of file temperature | |
| US11609884B2 (en) | Intelligent file system with transparent storage tiering | |
| US20190179808A1 (en) | Method and apparatus for data migration in database cluster, and storage medium | |
| CN103354923B (en) | A kind of data re-establishing method, device and system | |
| US11262916B2 (en) | Distributed storage system, data processing method, and storage node | |
| WO2020204880A1 (en) | Snapshot-enabled storage system implementing algorithm for efficient reclamation of snapshot storage space | |
| US20180205791A1 (en) | Object storage in cloud with reference counting using versions | |
| CN105657066A (en) | Load rebalance method and device used for storage system | |
| JP5592493B2 (en) | Storage network system and control method thereof | |
| TWI617924B (en) | Memory data versioning | |
| US11194501B2 (en) | Standby copies withstand cascading fails | |
| US11023159B2 (en) | Method for fast recovering of data on a failed storage device | |
| US11223681B2 (en) | Updating no sync technique for ensuring continuous storage service in event of degraded cluster state | |
| US20170220586A1 (en) | Assign placement policy to segment set | |
| CN115982101B (en) | Machine room data migration method and device based on multi-machine room copy placement strategy | |
| US20170357659A1 (en) | Systems and methods for managing snapshots of a file system volume | |
| CN105760391B (en) | Method for dynamic data redistribution, data node, name node and system | |
| US11216204B2 (en) | Degraded redundant metadata, DRuM, technique | |
| CN108132759A (en) | A kind of method and apparatus that data are managed in file system | |
| US12141105B2 (en) | Data placement selection among storage devices associated with nodes of a distributed file system cluster | |
| US20190188186A1 (en) | Consistent hashing configurations supporting multi-site replication | |
| WO2024187818A1 (en) | Data migration method, system and device and non-volatile readable storage medium | |
| CN106527982A (en) | Object distribution algorithm for object storage system consisting of heterogeneous storage devices | |
| WO2024012592A1 (en) | Adaptive data disk capacity management method and apparatus, electronic device, and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information | Address after:310052 11th Floor, 466 Changhe Road, Binjiang District, Hangzhou City, Zhejiang Province Applicant after:Xinhua Sanxin Information Technology Co., Ltd. Address before:310052 11th Floor, 466 Changhe Road, Binjiang District, Hangzhou City, Zhejiang Province Applicant before:Huashan Information Technology Co., Ltd. | |
| GR01 | Patent grant | ||
| GR01 | Patent grant |