CN115048061A

Movatterモバイル変換

Info

Publication number: CN115048061A
Application number: CN202210870513.2A
Authority: CN
Inventors: 乔媛媛; 徐明威; 陈劲伊; 杨洁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-09-13
Anticipated expiration: 2042-07-22
Also published as: CN115048061B

Abstract

The invention discloses a method for storing cold data based on Raft, which relates to the technical field of storage and comprises the following steps: constructing a plurality of ECRaft groups in a distributed cluster, wherein the members in the group of the ECRaft groups obtain a leader of each group through election; the leader periodically sends heartbeat in the group, and predicts the states of the members in the group according to the heartbeat information; after a client write request arrives, selecting a proper ECRaft group for processing based on the load condition; the leader of the selected ECRaft group performs erasure coding on the data contained in the write request, and generates and distributes log entries according to the predicted states of members in the group; and the leader cleans redundant data through a state machine, synchronously updates related data of members in the group through heartbeat, and finally stores the data in the distributed cluster in the form of erasure code segments. The invention not only saves the storage space and the network overhead flow in the cold data storage process, but also improves the efficiency of the process.

Description

Translated fromChinese

基于Raft的冷数据存储方法Raft-based cold data storage method

技术领域technical field

本发明涉及存储技术领域，更具体的说是涉及一种基于Raft的冷数据存储方法。The invention relates to the technical field of storage, and more particularly to a Raft-based cold data storage method.

背景技术Background technique

数据量的激增促进了信息存储领域的发展，同时也带来了新的挑战。这些数据按照访问频率大致可分为两类：经常访问的数据和不经常访问的数据，经常访问的数据被称为热数据，不经常访问的数据被称为冷数据。其中热数据只占比20％以下，因而大量数据都是冷数据，如果无差别的存储冷热数据，必然会浪费资源。此外，在大规模数据存储的时候，往往无法仅仅依靠单机设备完成存储，因为单机性能、存储容量、可靠性、可用性不能达到行业标准，所以通常需要成百上千个机器组成存储集群来存储数据，因此很多公司和单位开始研发和运用分布式存储技术。The proliferation of data volumes has boosted the field of information storage, but it has also brought new challenges. These data can be roughly divided into two categories according to the access frequency: frequently accessed data and infrequently accessed data. Frequently accessed data is called hot data, and infrequently accessed data is called cold data. Among them, hot data only accounts for less than 20%, so a large amount of data is cold data. If hot and cold data are stored indiscriminately, resources will be wasted. In addition, in the case of large-scale data storage, it is often impossible to rely on single-machine devices to complete storage, because the performance, storage capacity, reliability, and availability of single-machine devices cannot meet industry standards, so hundreds or thousands of machines are usually required to form a storage cluster to store data. , so many companies and units began to develop and use distributed storage technology.

纠删码技术是一种数据冗余容错机制，具备数据冗余度低、容错能力又足够高等特点。在分布式存储系统中，为了高可靠、高存储空间利用率地存储冷数据，越来越多的场景开始引入这种技术。采用纠删码存储技术可以将原来需要存储的完整数据分成很多片段并编码成一些奇偶校验片段，然后每个机器只存储其中的一个片段从而缩小占用的存储空间，进而带来存储成本和网络流量上的收益。但是分布式系统中依然存在纠删码技术无法解决的共识问题。Erasure coding technology is a data redundancy fault-tolerant mechanism, which has the characteristics of low data redundancy and sufficiently high fault tolerance. In distributed storage systems, in order to store cold data with high reliability and high storage space utilization, more and more scenarios have begun to introduce this technology. The erasure code storage technology can divide the complete data that needs to be stored into many fragments and encode them into some parity fragments, and then each machine only stores one of the fragments to reduce the occupied storage space, which in turn brings storage costs and network costs. revenue from traffic. However, there are still consensus problems that cannot be solved by erasure coding technology in distributed systems.

Paxos和Raft等共识协议可以用于解决该问题，从而提供高可用和高可靠的存储服务，因而被广泛部署在工业应用中。这种共识协议将用户数据转换为日志命令条目，然后将它们复制到不同的服务器中，最后用户数据被冗余地存储在集群中的所有服务器上，从而保证这些用户数据免受服务器故障或数据损坏的影响。但是Paxos和Raft使用副本形式的数据冗余机制来帮助系统容忍故障，该机制花费大量空间来存储冗余数据，并且将副本数据复制到其他服务器也带来了很多网络流量开销。例如，在一个有N＝(2F+1)个服务器的集群中，采用副本冗余容忍故障的共识算法可能会导致大约是原始数据量N倍的网络和存储开销。Consensus protocols such as Paxos and Raft can be used to solve this problem to provide highly available and highly reliable storage services, which are widely deployed in industrial applications. This consensus protocol converts user data into log command entries, then replicates them in different servers, and finally user data is redundantly stored on all servers in the cluster, thus guaranteeing these user data against server failures or data damage effects. But Paxos and Raft use replica-style data redundancy to help the system tolerate failures. This mechanism takes a lot of space to store redundant data, and replicating data to other servers also brings a lot of network traffic overhead. For example, in a cluster with N=(2F+1) servers, a fault-tolerant consensus algorithm with replica redundancy may incur network and storage overhead approximately N times the amount of original data.

为了将冷数据以纠删码冗余形式存储在分布式集群中，并解决其中的共识问题，现有两种方法被广泛使用：异步或同步。例如对于异步存储方法，数据会通过比如Raft、Paxos这样的共识协议或者其他能解决共识问题的方法先以副本形式存储在集群中，稍后再异步进行纠删编码过程，虽然存储完成后副本数据会被清除以提高存储空间利用效率，但是这些中间副本数据不可避免地会导致额外的存储设备IO和网络流量。对于同步存储方法，数据在写入时会立即被编码，因此存储过程中不会出现中间副本数据，但是同步存储方式要求在返回客户端成功前所有数据片段和校验片段都被成功写入存储集群中，因此，任何由服务器或网络故障引起的编码片段丢失都会很快干扰写入过程，这往往需要在系统设计中附加额外机制来解决。In order to store cold data in an erasure-coded redundant form in a distributed cluster and solve the consensus problem in it, two methods are widely used: asynchronous or synchronous. For example, for the asynchronous storage method, the data will be stored in the cluster in the form of copies through consensus protocols such as Raft, Paxos or other methods that can solve the consensus problem, and then the erasure encoding process will be performed asynchronously later, although the copy data is copied after the storage is completed. Will be cleared to improve storage space utilization efficiency, but these intermediate copy data will inevitably lead to additional storage device IO and network traffic. For the synchronous storage method, the data will be encoded immediately when written, so there will be no intermediate copy data during the storage process, but the synchronous storage method requires that all data segments and check segments are successfully written to the storage before returning to the client successfully. In a cluster, therefore, any loss of encoded segments caused by server or network failures can quickly interfere with the writing process, which often requires additional mechanisms in the system design to resolve.

因此，如何节省冷数据存储过程中的存储空间和网络流量开销，提高该过程的效率是本领域技术人员亟需解决的技术问题。Therefore, how to save storage space and network traffic overhead in the cold data storage process and improve the efficiency of the process are technical problems that those skilled in the art need to solve urgently.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明提供了一种基于Raft的冷数据存储方法，提升了冷数据存储过程中的存储效率、空间和网络带宽利用率。In view of this, the present invention provides a Raft-based cold data storage method, which improves storage efficiency, space and network bandwidth utilization in the cold data storage process.

为了实现上述目的，本发明提供如下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

一种基于Raft的冷数据存储方法，包括以下步骤：A Raft-based cold data storage method, comprising the following steps:

在分布式集群中构建多个ECRaft组，所述ECRaft组的组内成员通过选举得到每个ECRaft组的领导者；Constructing multiple ECRaft groups in a distributed cluster, the members of the ECRaft group get the leader of each ECRaft group through election;

所述领导者在组内周期性地发送心跳，并根据心跳信息预测组内成员状态；The leader periodically sends heartbeats in the group, and predicts the state of members in the group according to the heartbeat information;

客户端写入请求到达后，基于负载情况选择合适的ECRaft组处理所述写入请求；After the client write request arrives, select the appropriate ECRaft group based on the load to process the write request;

被选择ECRaft组的领导者将所述写入请求中包含的数据进行纠删编码，并根据预测的组内成员状态生成并分发日志条目；The leader of the selected ECRaft group erasure-encodes the data contained in the write request, and generates and distributes log entries according to the predicted state of members in the group;

所述领导者通过状态机清理冗余数据并通过心跳同步更新组内成员的相关数据，最终将数据以纠删码片段的形式存储在所述分布式集群中。The leader cleans up redundant data through a state machine, updates related data of members in the group synchronously through heartbeats, and finally stores the data in the distributed cluster in the form of erasure code segments.

上述技术方案达到的技术效果为：在共识协议Raft的基础上提出ECRaft，改变Raft的选举和提交条件，使用更优化的自适应数据冗余方式，增加了状态机数据清理机制，并使用Multi-ECRaft的机制解决单机性能瓶颈问题。The technical effects achieved by the above technical solutions are as follows: ECRaft is proposed on the basis of the consensus protocol Raft, the election and submission conditions of Raft are changed, a more optimized adaptive data redundancy method is used, the state machine data cleaning mechanism is added, and the Multi- The mechanism of ECRaft solves the bottleneck problem of single machine performance.

可选的，所述在分布式集群中构建多个ECRaft组，具体包括以下步骤：Optionally, the construction of multiple ECRaft groups in a distributed cluster specifically includes the following steps:

采用L台服务器组成Multi-ECRaft集群，每台服务器包含若干个工作存储设备和备用存储设备；L servers are used to form a Multi-ECRaft cluster, and each server contains several working storage devices and backup storage devices;

在每台服务器上均匀选取n块型号相同或者容量差值在预设范围内的工作存储设备并组合得到一个ECRaft组；其中，所选取工作存储设备数量总和为N＝2F+1＝k+m，k表示存储设备存储数据片段的数量、m表示存储设备存储校验片段的数量、k>m；On each server, evenly select n working storage devices with the same model or within a preset capacity difference and combine them to obtain an ECRaft group; wherein, the sum of the selected working storage devices is N=2F+1=k+m , k represents the number of data segments stored in the storage device, m represents the number of check segments stored in the storage device, k>m;

重复以上选取及组合过程，直至ECRaft组数量足够或所有服务器上的工作存储设备都被取完，最终得到Q个ECRaft组。The above selection and combination process is repeated until the number of ECRaft groups is sufficient or all the working storage devices on all servers are fetched, and finally Q ECRaft groups are obtained.

可选的，所述组合得到一个ECRaft组，具体包括以下步骤：Optionally, the combination obtains an ECRaft group, which specifically includes the following steps:

根据纠删码的配置以及集群的服务器数量计算组内成员的分布；Calculate the distribution of members in the group according to the configuration of the erasure code and the number of servers in the cluster;

统计集群中的存储设备型号和数量，从型号最多的存储设备开始分组，若同型号的存储设备数量不够，则寻找容量差值在预设范围内的存储设备进行补充；Count the model and number of storage devices in the cluster, and start grouping from the storage device with the largest model. If the number of storage devices of the same model is not enough, find the storage device with a capacity difference within the preset range to supplement;

统计集群中的每个服务器上某个型号的存储设备数量；Count the number of storage devices of a certain model on each server in the cluster;

从所述型号存储设备最多的服务器开始选取，依次从对应的服务器上取得相应数量的工作存储设备且每个服务器保留一定比例的备用存储设备未分配；若每个服务器上存储设备充足，则分配成功，组成ECRaft组并将所述ECRaft组的配置信息加入元数据管理数据库中；Select from the server with the most storage devices of the type, obtain the corresponding number of working storage devices from the corresponding servers in turn, and reserve a certain percentage of spare storage devices for each server to be unallocated; if there are sufficient storage devices on each server, allocate Success, form an ECRaft group and add the configuration information of the ECRaft group to the metadata management database;

若服务器上存储设备不充足，在选择工作存储设备时，所选型号的数量超过所述ECRaft组的一半，则选择容量差值在预设范围内的工作存储设备继续组合；若所选型号的数量未超过所述ECRaft组的一半，则中断ECRaft组的选择，重新选择其他型号的存储设备，统计集群中的每个服务器上重新选择型号的存储设备数量并继续进行ECRaft组的组合。If the storage devices on the server are insufficient, when selecting the working storage devices, the number of the selected models exceeds half of the ECRaft group, select the working storage devices whose capacity difference is within the preset range and continue to combine; If the number does not exceed half of the ECRaft group, the selection of the ECRaft group is interrupted, another type of storage device is re-selected, the number of the re-selected type of storage device on each server in the cluster is counted, and the combination of the ECRaft group is continued.

可选的，所述ECRaft组的组内成员通过选举得到每个ECRaft组的领导者，具体包括以下步骤：Optionally, the members of the ECRaft group obtain the leader of each ECRaft group through election, which specifically includes the following steps:

各个ECRaft组独立发起领导者选举，每个组内的所有成员均有随机超时机制，当固定时间内未收到领导者的心跳会超时成为候选者，所述候选者向组内其他成员广播竞选消息；Each ECRaft group independently initiates leader election. All members in each group have a random timeout mechanism. When the leader's heartbeat is not received within a fixed time, they will become candidates after timeout, and the candidates will broadcast the election to other members in the group. information;

当其他成员收到所述竞选消息后，确认自己的日志和所属任期在所述候选者之前时，会投出赞成票；When other members receive the campaign message and confirm that their log and term are ahead of the candidate, they will vote in favor;

当组内有k个成员投出赞成票时，所述候选者成为领导者。When k members in the group vote yes, the candidate becomes the leader.

可选的，所述预测组内成员状态，具体为：Optionally, the predicting the status of members in the group is specifically:

所述领导者周期性向组内其他成员发送心跳信息，成员收到心跳信息后完成更新日志操作并返回领导者，所述领导者以最近一次心跳成功响应的成员情况为依据预测组内成员状态。The leader periodically sends heartbeat information to other members in the group, the members complete the log update operation after receiving the heartbeat information and return to the leader, and the leader predicts the status of members in the group based on the member status that successfully responded to the latest heartbeat.

可选的，所述选择合适的ECRaft组处理所述写入请求，具体包括以下步骤：Optionally, selecting an appropriate ECRaft group to process the write request specifically includes the following steps:

判断客户端写入的文件是否存在，若存在则返回所述文件所在ECRaft组；若所述ECRaft组的执行队列已满，则加入所述ECRaft组的等待队列；否则，加入所述ECRaft组的执行队列；Determine whether the file written by the client exists, and if so, return to the ECRaft group where the file is located; if the execution queue of the ECRaft group is full, join the waiting queue of the ECRaft group; otherwise, join the ECRaft group execution queue;

若客户端写入的文件不存在，则根据各个ECRaft组的任务量情况，将所述文件分配至一个任务量最少的ECRaft组。If the file written by the client does not exist, according to the task amount of each ECRaft group, the file is allocated to an ECRaft group with the least amount of tasks.

可选的，所述生成并分发日志条目，具体包括以下步骤：Optionally, generating and distributing log entries specifically includes the following steps:

被选择ECRaft组的领导者对写入请求中的数据块进行纠删编码，将所述数据块分成k个大小相同的数据片段，并编码生成m个校验片段；The leader of the selected ECRaft group performs erasure coding on the data block in the write request, divides the data block into k data fragments of the same size, and encodes to generate m check fragments;

领导者根据预测的组内成员状态决定数据的冗余策略，并将生成的编码片段封装成日志条目分发给组内各个成员，具体为：The leader decides the redundancy strategy of the data according to the predicted state of the members in the group, and encapsulates the generated code fragments into log entries and distributes them to each member in the group, specifically:

当所有k+m个成员均被预测为健康时，领导者将所有编码片段分发到对应的成员上，并确保所有片段均已经持久化；When all k+m members are predicted to be healthy, the leader distributes all encoded fragments to the corresponding members and ensures that all fragments have been persisted;

当p个成员被预测为无法接收相应片段且p<＝m时，领导者将故障成员应保存的编码片段持久化到每一个健康的成员中，且健康成员同时保存自己对应的编码片段；When p members are predicted to be unable to receive the corresponding fragments and p<=m, the leader persists the coding fragments that should be saved by the faulty member to each healthy member, and the healthy members save their corresponding coding fragments at the same time;

当p个成员被预测为无法接收相应片段且p>m时，采用副本复制的策略，领导者将完整数据封装为日志条目复制到其他成员中，ECRaft组中半数以上的成员复制成功后即响应客户端成功；When p members are predicted to be unable to receive the corresponding fragments and p>m, the strategy of replica replication is adopted. The leader encapsulates the complete data as log entries and replicates them to other members. More than half of the members in the ECRaft group will respond after the replication is successful. client success;

在领导者分发日志的过程中，当ECRaft组状态预测错误导致分发日志失败时，领导者重新预测并按照最新的ECRaft组状态重新分发；若重发次数大于系统配置参数q，则领导者采用副本复制策略继续重试。In the process of the leader distributing the log, when the ECRaft group state prediction error causes the log distribution to fail, the leader re-predicts and redistributes according to the latest ECRaft group state; if the number of retransmissions is greater than the system configuration parameter q, the leader adopts the copy The replication strategy continues to retry.

可选的，所述方法还包括：Optionally, the method further includes:

若组内成员缺乏已经被领导者提交的日志，则需领导者将所述日志相应的编码片段复制给所述组内成员；If the members in the group lack the log that has been submitted by the leader, the leader needs to copy the corresponding code fragment of the log to the members in the group;

若客户端写入请求被领导者成功持久化到组内成员中，则需领导者提交日志并响应客户端成功，完成冷数据存储过程，否则领导者会一直重试直到客户端等待超时。If the client write request is successfully persisted to the members of the group by the leader, the leader needs to submit the log and respond to the success of the client to complete the cold data storage process, otherwise the leader will keep retrying until the client waits for a timeout.

可选的，所述领导者通过状态机清理冗余数据并通过心跳同步更新组内成员的相关数据，具体为：Optionally, the leader cleans up redundant data through a state machine and updates related data of members in the group synchronously through heartbeat, specifically:

在领导者分发数据的过程中，ECRaft组故障的出现导致成员持久化不属于所述成员的编码片段，当这些编码片段被与之对应的成员重新保存后，领导者进行状态机删除；若ECRaft组中的所有成员均为健康的，则最终所有健康成员的状态机中仅保留与自己对应的片段数据。In the process of data distribution by the leader, the occurrence of ECRaft group failure causes members to persist code fragments that do not belong to the member. When these code fragments are re-saved by the corresponding members, the leader deletes the state machine; if ECRaft All members in the group are healthy, and finally only the segment data corresponding to themselves are retained in the state machine of all healthy members.

可选的，所述方法还包括：Optionally, the method further includes:

若ECRaft组所在工作存储设备已满且已经完成状态机删除，所述ECRaft组中所有成员均没有存储不属于与自己对应的编码片段，则所述ECRaft组满足存储设备下电条件，集群之后关闭所述ECRaft组内所有成员所在工作存储设备的电源。If the working storage device where the ECRaft group is located is full and the state machine deletion has been completed, and all members of the ECRaft group do not store code fragments that do not belong to themselves, then the ECRaft group satisfies the storage device power-off condition, and the cluster is shut down afterward. The power supply of the working storage devices of all members in the ECRaft group.

经由上述的技术方案可知，与现有技术相比，本发明公开提供了一种基于Raft的冷数据存储方法，具有以下有益效果：As can be seen from the above technical solutions, compared with the prior art, the present invention provides a Raft-based cold data storage method, which has the following beneficial effects:

(1)本发明对客户端的请求数据直接进行编码并分发，避免了现有方法中对完整数据的复制与分发，进而带来更低的存储IO开销和网络成本；(1) the present invention directly encodes and distributes the request data of the client, avoids the duplication and distribution of the complete data in the existing method, and then brings lower storage IO overhead and network cost;

(2)本发明可以达到和异步冷数据存储方式一样的容错能力，且最后状态机中都只有纠删码数据，且本方案没有引入存储中间完整数据的过程，因而存储过程更简单、效率更高；(2) The present invention can achieve the same fault tolerance as the asynchronous cold data storage method, and the final state machine only has erasure code data, and this scheme does not introduce the process of storing intermediate complete data, so the storage process is simpler and more efficient. high;

(3)本发明采用Multi-ECRaft的机制，能合理利用集群中所有服务器的性能，避免领导者所在服务器因任务量大而达到瓶颈，从而影响了整个集群的性能；(3) The present invention adopts the mechanism of Multi-ECRaft, which can reasonably utilize the performance of all servers in the cluster, so as to avoid the bottleneck of the server where the leader is located due to the large amount of tasks, thereby affecting the performance of the entire cluster;

(4)本发明中ECRaft组独立管理，文件被尽量地集中存储，从而方便集群对该组存储设备的上下电、扫描、重构逻辑的管理。(4) In the present invention, the ECRaft group is independently managed, and files are stored centrally as much as possible, thereby facilitating the cluster to manage the power-on, power-on, scan, and reconstruction logic of the group of storage devices.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

图1为基于Raft的冷数据存储方法的实现流程图；Fig. 1 is the realization flow chart of the cold data storage method based on Raft;

图2为Multi-ECRaft架构设计图；Figure 2 is the architectural design diagram of Multi-ECRaft;

图3为编码过程图；Fig. 3 is a coding process diagram;

图4为自适应的编码片段复制方式图；Fig. 4 is a self-adaptive coding segment copying mode diagram;

图5为状态机清理过程图。Figure 5 is a diagram of the state machine cleaning process.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例公开了一种基于Raft的冷数据存储方法，包括以下步骤：The embodiment of the present invention discloses a Raft-based cold data storage method, comprising the following steps:

在分布式集群中构建多个ECRaft组，ECRaft组的组内成员通过选举得到每个ECRaft组的领导者；Build multiple ECRaft groups in a distributed cluster, and the members of the ECRaft group get the leader of each ECRaft group through election;

领导者在组内周期性地发送心跳，并根据心跳信息预测组内成员状态；The leader periodically sends heartbeats in the group, and predicts the status of members in the group according to the heartbeat information;

客户端写入请求到达后，基于负载情况选择合适的ECRaft组处理写入请求；After the client write request arrives, select the appropriate ECRaft group based on the load to process the write request;

被选择ECRaft组的领导者将写入请求中包含的数据进行纠删编码，并根据预测的组内成员状态生成并分发日志条目；The leader of the selected ECRaft group erasure-encodes the data contained in the write request, and generates and distributes log entries based on the predicted state of members within the group;

领导者通过状态机清理冗余数据并通过心跳同步更新组内成员的相关数据，最终将数据以纠删码片段的形式存储在所述分布式集群中。The leader cleans up redundant data through the state machine and updates related data of members in the group synchronously through heartbeat, and finally stores the data in the distributed cluster in the form of erasure code fragments.

接下来，参照图1，对基于Raft的冷数据存储方法的实现步骤进行更进一步地分析。Next, referring to FIG. 1 , the implementation steps of the Raft-based cold data storage method are further analyzed.

步骤1：采用L台服务器组成Multi-ECRaft集群，每台服务器包含若干个工作存储设备和备用存储设备，具体地，存储设备可以为机械硬盘；参照图2，在本实施例中，使用4台服务器组建集群，每台服务器包含若干块工作硬盘和备用硬盘。Step 1: L servers are used to form a Multi-ECRaft cluster. Each server includes several working storage devices and backup storage devices. Specifically, the storage devices can be mechanical hard disks. Referring to FIG. 2 , in this embodiment, four storage devices are used. The servers form a cluster, and each server contains several working hard disks and spare hard disks.

步骤2：在每台服务器上均匀选取n块型号相同或者容量性能相似的工作硬盘，需要满足所选取工作硬盘数量总和为N＝(2F+1)＝(k+m)，其中k个硬盘存储数据片段、m个硬盘存储校验片段，k>m；采用(k,m)-RS纠删码配置，RS为Reed-Solomon码，将这些工作硬盘组合成一个ECRaft组，ECRaft组的成员可以是一块硬盘、多块硬盘或者一块硬盘的其中一部分，该过程的实施步骤如下：Step 2: Evenly select n working hard disks with the same model or similar capacity and performance on each server, and the sum of the selected working hard disks must be N=(2F+1)=(k+m), where k hard disks store Data fragment, m hard disk storage check fragment, k>m; adopt (k,m)-RS erasure code configuration, RS is Reed-Solomon code, these working hard disks are combined into an ECRaft group, members of ECRaft group can It is a hard disk, multiple hard disks or part of a hard disk. The implementation steps of this process are as follows:

2.1)根据纠删码的配置以及集群的服务器数量计算组内成员的分布；2.1) Calculate the distribution of members in the group according to the configuration of the erasure code and the number of servers in the cluster;

2.2)统计集群中的硬盘型号和数量，从型号最多的硬盘开始分组，若同型号的硬盘数量不够，则寻找容量差值在预设范围内的硬盘进行补充；2.2) Count the model and quantity of hard disks in the cluster, and start grouping from the hard disk with the largest model. If the number of hard disks of the same model is not enough, look for the hard disks whose capacity difference is within the preset range to supplement;

2.3)统计集群中的每个服务器上某个型号的硬盘数量，如统计结果为M＝{M₁,M₂,...,M_n}，

M_i表示第i个服务器上该型号硬盘的数量；2.3) Count the number of hard disks of a certain model on each server in the cluster. For example, the statistical result is M={M₁ ,M₂ ,...,M_n },

M_i represents the number of hard disks of this type on the i-th server;

2.4)将M从大到小的方式排列，依次从对应的服务器上取得相应数量的硬盘，每个服务器会保留一定比例的硬盘未分配；若每个服务器上硬盘充足，则分配成功，执行2.6)，否则执行下一步；2.4) Arrange M in descending order, obtain the corresponding number of hard disks from the corresponding servers in turn, each server will keep a certain proportion of hard disks unallocated; if there are enough hard disks on each server, the allocation is successful, and execute 2.6 ), otherwise go to the next step;

2.5)在选择工作硬盘时，所选型号的数量一定要占该ECRaft组的大多数，若不是，则中断该ECRaft组的选择，重新选择其他型号的硬盘，回到步骤2.3)；若超过了大多数，但是硬盘数量不够，则选择容量差值在预设范围内型号的硬盘，继续组合；2.5) When selecting a working hard disk, the number of selected models must account for the majority of the ECRaft group. If not, then interrupt the selection of the ECRaft group, re-select other types of hard disks, and go back to step 2.3); Most, but the number of hard disks is not enough, select the hard disks with the capacity difference within the preset range, and continue to combine;

2.6)成功组成ECRaft组，将ECRaft组的配置信息加入元数据管理数据库中。2.6) The ECRaft group is successfully formed, and the configuration information of the ECRaft group is added to the metadata management database.

在本实施例中，n＝2、N＝6、k＝4、m＝2，在每台服务器保留2块备用硬盘，ECRaft组的成员是一块硬盘；有4个存储服务器，每个服务器上分布2个成员。服务器1上有A型号的硬盘20个，所以服务器1上的硬盘只能分配到9个ECRaft组。In this embodiment, n=2, N=6, k=4, m=2, 2 spare hard disks are reserved on each server, and the member of the ECRaft group is one hard disk; there are 4 storage servers, each server Distribute 2 members. There are 20 hard disks of type A onserver 1, so the hard disks onserver 1 can only be assigned to 9 ECRaft groups.

步骤3：重复步骤2的过程，直至所有服务器上的工作硬盘都被取完，最终得到Q个ECRaft组，每个组的成员表示为S＝(S₁,S₂...S_i...S_N)。Step 3: Repeat the process ofStep 2 until all the working hard disks on the servers are taken out, and finally get Q ECRaft groups, the members of each group are represented as S=(S₁ , S₂ ... S_i .. .S_N ).

在本实施例中，参照图2，Q＝3，这3组ECRaft硬盘均匀分布在4台服务器中，组内成员表示为S＝(S₁,S₂,S₃,S₄,S₅,S₆)。In this embodiment, referring to FIG. 2 , Q=3, the three groups of ECRaft hard disks are evenly distributed in four servers, and the members in the group are represented as S=(S₁ , S₂ , S₃ , S₄ , S₅ ,_S6 ).

步骤4：各个ECRaft组独立发起领导者选举，每个ECRaft组内的所有成员均有随机超时机制，当一段时间未收到领导者的心跳会超时成为候选者，随后候选者向组内其他成员广播竞选消息；当其他成员收到该消息后，如果发现自己的日志和所属任期不比该候选者的新，就会投出赞成票；当候选者收到k个赞同响应后(包括自己投的赞成票)，该候选者成为该组的领导者。Step 4: Each ECRaft group independently initiates leader election. All members in each ECRaft group have a random timeout mechanism. When the leader's heartbeat is not received for a period of time, they will become candidates after timeout. Broadcast the campaign message; when other members receive the message, if they find that their log and term of office are not newer than the candidate's, they will vote in favor; when the candidate receives k approval responses (including their own votes) yes vote), the candidate becomes the leader of the group.

在本实施例中，追随者成为候选者后，向其他成员发布竞选信息，因为自己也会投一票，所以当自己收集到3个其他成员的赞同票后，就会成为领导者。In this embodiment, after a follower becomes a candidate, he publishes campaign information to other members, and because he will also cast a vote, he will become a leader when he collects approval votes from 3 other members.

步骤5：领导者周期性向组内其他成员发送心跳信息，成员随即会做出响应。当领导者确认收到了某个成员的心跳回应后，即认为该成员是健康的。领导者会维系一个动态变化的健康成员列表，记录此时刻领导者认为是健康的成员，此列表就是预测的组内成员状态U。Step 5: The leader periodically sends heartbeat information to other members in the group, and the members will respond immediately. When the leader confirms that it has received a heartbeat response from a member, the member is considered healthy. The leader will maintain a dynamically changing list of healthy members, and record the members that the leader considers to be healthy at this moment. This list is the predicted state U of members in the group.

在本实施例中，如果领导者可以收到组中所有成员的心跳回应，那么它预测的组内成员状态U＝(S₁,S₂,S₃,S₄,S₅,S₆)。In this embodiment, if the leader can receive heartbeat responses from all members in the group, then its predicted state of the members in the group is U=(S₁ , S₂ , S₃ , S₄ , S₅ , S₆ ).

步骤6：客户端写入请求W到达，集群会根据负载情况选择ECRaft组处理该请求，具体为：Step 6: The client write request W arrives, and the cluster will select the ECRaft group to process the request according to the load, specifically:

6.1)判断客户端写入的文件是否存在，若存在则返回该文件所在ECRaft组；6.1) Determine whether the file written by the client exists, and if so, return the ECRaft group where the file is located;

6.2)若该ECRaft组的执行队列已满，则加入该ECRaft组的等待队列；否则，加入其执行队列；6.2) If the execution queue of the ECRaft group is full, join the waiting queue of the ECRaft group; otherwise, join its execution queue;

6.3)若客户端写入的文件不存在，则根据各个ECRaft组的任务量情况，为其分配一个任务量最少的ECRaft组。6.3) If the file written by the client does not exist, according to the task volume of each ECRaft group, it will be assigned an ECRaft group with the least amount of tasks.

在本实施例中，ECRaft组执行队列的长度为100。In this embodiment, the length of the execution queue of the ECRaft group is 100.

步骤7：被选择ECRaft组的领导者先将该请求包含的数据进行纠删编码，过程参见图3，将数据分割为k个大小相同的数据片段D＝(D₁,D₂,...,D_k)，对D进行RS编码后生成m个校验片段Q＝(Q₁,Q₂,...,Q_m)，这些片段组合为F＝(F₁,F₂,...,F_k+m)＝(D₁,D₂,...,D_k,Q₁,Q₂,...,Q_m)。时期和索引是协议需要的元信息，领导者分发的日志包含数据或校验片段、时期、索引。Step 7: The leader of the selected ECRaft group first performs erasure coding on the data contained in the request. The process is shown in Figure 3, and the data is divided into k data segments of the same size D=(D₁ , D₂ ,... , D_k ), after performing RS encoding on D, m check segments Q=(Q₁ , Q₂ ,...,Q_m ) are generated, and these segments are combined as F=(F₁ , F₂ ,... ,F_k+m )=(D₁ , D₂ ,...,D_k ,Q₁ ,Q₂ ,...,Q_m ). The epoch and index are meta-information required by the protocol, and the log distributed by the leader contains data or check segments, epoch, and index.

在本实施例中，集群维护每个ECRaft组的负载列表，每次选择负载最少的ECRaft组服务客户端请求，领导者将数据分割为4个大小相同的数据片段D＝(D₁,D₂,D₃,D₄)，对D进行RS编码后产生2个校验数据片段Q＝(Q₁,Q₂)，这些片段组合为F＝(F₁,F₂,...,F₆)＝(D₁,D₂,D₃,D₄,Q₁,Q₂)，每条日志都由片段、时期和索引组成。In this embodiment, the cluster maintains the load list of each ECRaft group, and each time the ECRaft group with the least load is selected to serve client requests, the leader divides the data into 4 data segments of the same size D=(D₁ , D₂ , D₃ , D₄ ), after performing RS encoding on D, two check data segments Q=(Q₁ , Q₂ ) are generated, and these segments are combined as F=(F₁ , F₂ ,...,F₆ ) = (D₁ , D₂ , D₃ , D₄ , Q₁ , Q₂ ), each log consists of a segment, an epoch, and an index.

步骤8：领导者根据预测的组内成员状态生成并分发日志条目，存在下面四种情况：Step 8: The leader generates and distributes log entries according to the predicted state of members in the group. There are the following four situations:

8.1)所有成员S＝(S₁,S₂...S_i...S_N)都在领导者的健康列表中，领导者将F一一对应的封装到(k+m)个不同的日志L＝(L₁,L₂,...L_k...L_k+m)中，其中(L₁,L₂,...L_k)一一对应的包含(D₁,D₂,...,D_k)，(L_k+1...L_k+m)一一对应的包含(Q₁,Q₂,...,Q_m)。L和S之间也是一一对应的，即除了自己以外，领导者将L₁发给S₁、L₂发给S₂，以此类推，L_k+m发给S_N。当所有成员S持久化对应的日志后，领导者即可提交日志L。8.1) All members S = (S₁ , S₂ ... S_i ... S_N ) are in the leader's health list, and the leader encapsulates F into (k+m) different In log L=(L₁ , L₂ ,...L_k ...L_k+m ), where (L₁ , L₂ ,... L_k ) contains (D₁ , D₂ , D 2 ) in one-to-one correspondence ,...,D_k ), (L_k+1 ...L_k+m ) include (Q₁ ,Q₂ ,...,Q_m ) in a one-to-one correspondence. There is also a one-to-one correspondence between L and S, that is, except for himself, the leader sends L₁ to S₁ , L₂ to S₂ , and so on, L_k+m to S_N . After all members S persist the corresponding logs, the leader can submit the log L.

8.2)S_e＝(S_e1,S_e2,...,S_ep)成员被领导者预测是故障的，其中p<＝m，那么领导者分发给剩余健康成员的日志L_i不仅要包括与其对应的片段F_i，还需包括S_e应该保存的片段F_e＝(F_e1,F_e2,...,F_ep)，即L_i＝(F_i,F_e1,F_e2,...,F_ep)，其中i表示某个健康的成员。所有剩余健康成员持久化了这些片段后，领导者即可提交该日志。_8.2 ) S_e = (S_e1 ,_S_e2 , . The corresponding segment F_i also needs to include the segment F_e =(F_e1 ,F_e2 ,...,F_ep ) that should be saved by Se, that is,_Li_{=(F i}_, F_e1 ,F_e2 ,... ,F_ep ), where i represents a healthy member. After all remaining healthy members have persisted these fragments, the leader can commit the log.

8.3)S_e＝(S_e1,S_e2,...,S_ep)成员被领导者预测是故障的，其中p>m，采用副本复制的策略，副本复制策略中领导者将客户端完整数据封装为日志条目复制到其他健康成员中，当组中半数以上的成员持久化了该日志后，领导者即可提交该日志。副本复制的详细过程同Raft，不再赘述。_8.3 ) S_e = (S_e1 , S_e2 , . Encapsulated as log entries and replicated to other healthy members, when more than half of the members in the group have persisted the log, the leader can commit the log. The detailed process of replica replication is the same as that of Raft, and will not be repeated here.

8.4)在领导者分发日志的过程中成员状态变化导致分发失败，分发日志前，S_i成员被预测正常，但是分发日志过程中，S_i故障导致其无法接收日志，此时，领导者需更新成员状态，并按照最新的成员状态重新分发该日志，重发次数加1。8.4) During the process of distributing logs by the leader, the member state changes and the distribution fails. Before distributing the logs, the members of Si are_predicted to be normal, but during the process of distributing the logs, the failure of Si_causes them to fail to receive the logs. At this time, the leader needs to update member status, and redistribute the log according to the latest member status, plus 1 for the number of retransmissions.

8.4.1)如果重发次数小于等于q(系统配置参数)次，领导者重新执行步骤5所示的分发策略；8.4.1) If the number of retransmissions is less than or equal to q (system configuration parameter) times, the leader re-executes the distribution strategy shown instep 5;

8.4.2)如果重发次数大于q(系统配置参数)次，领导者执行副本复制策略；在本实施例中，图4展示了5.1和5.2的过程，成员S₄被选举为领导者，当所有成员都健康时，条目b在T₀时到达，领导者在提交条目b之前，除了自己要持久化条目b和b的编码结果，也就是b₁,b₂,b₃,b₄,b₅,b₆，还需要确认b₁,b₂,b₃,b₅,b₆分别被S₁,S₂,S₃,S₅,S₆一一对应的持久化了。在T₁时刻，条目c到达，S₆发生故障。领导者将每个成员对应的片段和c₆存储到S₁,S₂,S₃,S₅中。在T₂时刻条目d到达，S₅和S₆发生故障，所以领导者需要将每个成员对应的片段和d₅、d₆持久化到S₁,S₂,S₃中。8.4.2) If the number of retransmissions is greater than q (system configuration parameter) times, the leader executes the copy replication strategy; in this embodiment, Figure 4 shows the process of 5.1 and 5.2, member S₄ is elected as the leader, when When all members are healthy, entry b arrives at T₀ , and before the leader commits entry b, in addition to persisting the encoded results of entries b and b, that is, b₁ ,b₂ ,b₃ ,b₄ ,b₅ , b₆ , it is also necessary to confirm that b₁ , b₂ , b₃ , b₅ , b₆ are persisted by S₁ , S₂ , S₃ , S₅ , and S₆ in one-to-one correspondence. At time T1, entry_c arrives and_S6 fails. The leader stores each member's corresponding segment and c₆ into S₁ , S₂ , S₃ , and S₅ . When entry_d_arrives at time T2, S5 and_S6 fail, so the leader needs to persist each member's corresponding segment and_d5 ,_d6 to_S1 ,_S2 ,_S3 .

步骤9：对于已经被领导者提交的日志，如果组中存在某个成员缺少它们，此时有两种情况：Step 9: For the logs that have been submitted by the leader, if there is a member of the group that is missing them, there are two situations:

9.1)领导者上存在F＝(F₁,F₂,...,F_k+m)这些数据的片段形式，它将包含对应片段的日志发送给这个成员即可，比如S_i中如果缺乏F_i，领导者将L_i发给S_i；9.1) There is a fragment form of these data F=(F₁ , F₂ ,..., F_k+m ) on the leader, and it will send the log containing the corresponding fragment to this member. For example, if there is a lack of S_i F_i , the leader sends_Li to_Si ;

9.2)领导者上只存在副本数据，它先进行纠删码编码过程生成这些片段，随后将对应片段发送给相应成员。9.2) There is only replica data on the leader. It first performs erasure coding to generate these fragments, and then sends the corresponding fragments to the corresponding members.

步骤10：如果客户端的写入请求W被领导者按照步骤8的方式成功持久化到组内成员中，领导者便可以提交日志并响应客户端成功，否则领导者会一直重试直到客户端等待超时。Step 10: If the client's write request W is successfully persisted to the members of the group by the leader according to step 8, the leader can submit the log and respond to the client's success, otherwise the leader will keep retrying until the client waits time out.

步骤11：11.2)、11.3)、11.4)复制过程会在S＝(S₁,S₂...S_i...S_N)上持久化了(F_e1,F_e2,...,F_ep)或完整副本数据。状态机清理过程将清理这些数据，该过程不会直接删除日志条目，而是删除状态机中该日志的数据。Step 11: 11.2), 11.3), 11.4) The replication process persists (F_e1 , F_e2 ,...,F on S = (S₁ , S₂ ... S_i ... S_N )_ep ) or full copy data. This data is cleaned up by the state machine cleanup process, which does not delete log entries directly, but deletes the data for that log in the state machine.

11.1)领导者靠心跳了解每个成员的日志存储情况，如果它发现S₁持久化并应用了L₁，S₂持久化并应用了L₂，以此类推，S_N持久化并应用了L_k+m。它将标记L为可清理的条目，并且更新最大可清理日志索引号为L的索引号；11.1)_The leader knows the log storage situation_of each member by heartbeat. If it finds that S1 is persistent and L1 is applied, S2 is persistent and L2 is_applied , and so_on , S_N is persistent and L2 is applied._k+m . It will mark L as a cleanable entry, and update the index number with the largest cleanable log index number L;

11.2)领导者先根据最大可清理日志索引号清理自己的日志，然后通过心跳告知其他成员自己已经清理日志的最大索引号，其他成员依据此索引号对各自的状态机执行清理，此外领导者可以选择周期性或者手动在系统空闲，比如夜间或者存储容量不足等合适的时机触发状态机清理；11.2) The leader first cleans its own log according to the maximum cleanable log index number, and then informs other members of the maximum index number of the cleaned log through heartbeat. Other members clean up their state machines according to this index number. In addition, the leader can Choose to periodically or manually trigger the state machine cleanup when the system is idle, such as at night or when the storage capacity is insufficient.

11.3)在完成L的状态机清理后，S₁的状态机仅留下L₁的数据，S₂的状态机仅留下L₂的数据，以此类推，S_N的状态机留下L_k+m的数据。11.3) After the state machine of L is cleaned up, the state machine of S₁ only leaves the data of L₁ , the state machine of S₂ only leaves the data of L₂ , and so on, the state machine of S_N leaves L_{k +m} data.

在本实施例中，参见图5，S₄是领导者，采用的纠删码配置是(4,2)-RS。ECRaft组内成员S＝(S₁,S₂,S₃,S₄,S₅,S₆)，条目a的片段已分别被6个成员持久化，而条目b还没有，因为b₆没被S₆持久化，因此条目a是可清除的，最大可清理日志索引号更新为a的索引号。在对a清除的过程中，S₄先删除条目a的其他片段应用到状态机中的数据，只保留F₄应用到状态机的数据，随后告知S_i(1≤i≤N，i≠4)去删除F_j(j≠i)应用到状态机的数据。In this embodiment, referring to FIG.₅ , S4 is the leader, and the erasure erasure code configuration adopted is (4,2)-RS. Member S=(S₁ , S₂ , S₃ , S₄ , S₅ , S₆ ) in the ECRaft group, the segment of entry a has been persisted by 6 members respectively, but entry b has not, because b₆ has not been S₆ is persistent, so entry a is purgeable, and the maximum purgeable log index number is updated to the index number of a. In the process of clearing a, S₄ first deletes the data applied to the state machine by other fragments of entry a, and only retains the data applied to the state machine by F₄ , and then informs S_i (1≤i≤N, i≠4 ) to delete the data that F_j (j≠i) applies to the state machine.

步骤12：如果ECRaft组所在硬盘已满，且已经完成步骤11，该ECRaft组中所有成员均没有存储不属于它的编码片段，则该ECRaft组满足磁盘下电条件，集群之后会关闭该ECRaft组内所有成员所在硬盘的电源。Step 12: If the hard disk where the ECRaft group is located is full, and step 11 has been completed, and all members of the ECRaft group do not store code fragments that do not belong to it, the ECRaft group meets the disk power-off conditions, and the ECRaft group will be shut down after the cluster. The power supply of the hard disks where all members are located.

本发明对客户端的请求数据直接进行编码并分发，避免了现有方法中对完整数据的复制与分发，进而带来更低的存储IO开销和网络成本；可以达到和异步冷数据存储方式一样的容错能力，且最后状态机中都只有纠删码数据，且本方案没有引入存储中间完整数据的过程，因而存储过程更简单、效率更高；采用Multi-ECRaft的机制，能合理利用集群中所有服务器的性能，避免领导者所在服务器因任务量大而达到瓶颈，从而影响了整个集群的性能；ECRaft组独立管理，文件被尽量地集中存储，从而方便集群对该组硬盘的上下电、扫描、重构逻辑的管理。The invention directly encodes and distributes the request data of the client, avoids the duplication and distribution of the complete data in the existing method, thereby bringing about lower storage IO overhead and network cost; it can achieve the same storage method as asynchronous cold data. Fault tolerance, and the final state machine only has erasure code data, and this scheme does not introduce the process of storing intermediate complete data, so the storage process is simpler and more efficient; using the Multi-ECRaft mechanism, it can reasonably utilize all the data in the cluster. The performance of the server can avoid the bottleneck of the leader's server due to the large amount of tasks, thus affecting the performance of the entire cluster; the ECRaft group is managed independently, and the files are stored in a centralized manner as much as possible, which facilitates the cluster to power on and off, scan, Management of refactoring logic.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.