Background
The high reliability of the data of the distributed database is usually realized based on the redundancy of the data copies, and by storing the service data in a plurality of data copies at the same time, the distributed database cluster is still available and the data is safe under the fault scene of the data main copy.
The node strong synchronous protocol of the current mainstream is a distributed synchronous protocol based on multiple groups, such as Paxos protocol and Raft protocol.
The Raft protocol is a distributed coherency protocol that is currently in relatively wide use. In the Raft protocol, each node has a state, and the state machine is deterministic, with different states corresponding to different series of operations. The synchronous replication function is added on the basis of a state machine, and each node can save the position and the sequence of synchronous replication in a cache or a disk during synchronous replication. Briefly, multiple nodes start from the same initial state, execute the same series of commands, and produce the same final state.
The node states in the Raft protocol are three: candidates (candidates), leaders (leader), followers (follower). The candidate indicates that the node is a candidate for voting the master node, the candidate can be a leader or a follower, the leader indicates that the node is the elected master node and mainly processes communication and interaction with the client and also maintains data synchronization and heartbeat mechanism with the follower, and the follower indicates that the node is a standby node and is mainly responsible for forwarding a request of the client to the master node and then maintaining data synchronization with the master node.
The Raft protocol employs a "half election" mechanism to elect leaders and followers. Taking three nodes (N=3) as an example, firstly, the three nodes enter a candidate state, a Term period is created respectively, then, three nodes vote in the Term period, a certain node can become a leader only when the number of votes is greater than N/2 according to a 'half-election' mechanism, the node which becomes the leader sends heartbeats to other nodes, and the other nodes are converted from the candidate to a follower. Also employed are "over half election" mechanisms such as zookeeper clusters, which are not described in detail herein.
Assuming node B is the leader, nodes a and C are followers, clients send data to node B, and node B begins data storage and data synchronization when receiving instructions, which is not direct storage but instead employs two-phase commit. The two-stage submitting specific process comprises the steps that in the first stage, a leader node B receives a client data request and initiates requests to all follower nodes A and C, the follower nodes A and C receive the requests of the leader node B, confirm that the follower nodes A and C can execute data storage, respond to the leader node B by Ack, judge that if more than half of nodes return executable Ack after receiving Ack response, the leader node B stores data and records logs, returns successful responses to the client, and then in the second stage, the leader node B sends requests confirming execution to all follower nodes A and C again, and the follower nodes A and C store data and record logs. Note that follower nodes a and C also accept client requests, but do not process the requests, but rather forward the requests to leader node B for processing.
Traditionally, through multiple-dispatch (including leader and follower) data write acknowledgements and multiple-dispatch node failure masters, the protocol level ensures that data between data copies is strongly consistent and highly reliable with less than half of the total number of failed nodes. However, this conventional technique generally has the following drawbacks:
(1) The main selection can be realized only under the condition that more than half nodes survive, so that at least 3 machine rooms are required when the distributed database cluster is deployed across multiple centers of the machine rooms, and the cost is high;
(2) If the number of nodes in the main machine room a is greater than the number of nodes in the backup machine room B in the case of a co-city dual-active deployment, the entire database cluster will not be available and data may be lost when machine room a fails.
Therefore, there is a need to provide a method that allows for a strong consistency of database cluster data in case of failure of half the number of nodes, ensuring a high degree of data integrity and zero loss of data. In addition, the distributed database cluster is expected to construct a double-activity single large cluster in the same city, so that the multi-activity high reliability of the distributed database cluster across the machine room can be realized without 3 machine rooms, the cost is greatly reduced, and meanwhile, the high reliability and the strong consistency of the database cluster data are still ensured.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Aiming at the problems of high cost and inapplicability to common dual-activity deployment of the traditional distributed protocol, the application provides a method and a device for carrying out strong consistent synchronization on nodes of a distributed database based on an external observer. On the premise of ensuring strong consistency of a plurality of nodes of the distributed database cluster, the application allows nodes in the distributed database cluster to be still ensured to have strong consistency of data and zero loss of failure switching data under the condition that half of nodes fail by introducing scheduling nodes of an external observer group. The distributed database cluster can construct a double-activity single large cluster in the same city based on the technical scheme provided by the application, so that the multi-activity high reliability of the distributed database cluster across machine rooms can be realized without three machine rooms, the cost is greatly reduced, and meanwhile, the high reliability and the strong consistency of the data of the distributed database cluster are still ensured.
The method and the system support that in the case that half of nodes in the distributed database cluster fail, half of node voting master nodes are automatically selected based on an external observer group, and data zero loss is ensured, so that the method and the system are very suitable for common co-city dual-activity single large cluster deployment.
According to one aspect of the embodiment of the application, a method for synchronizing distributed database cluster nodes based on external observers is provided, wherein when the distributed database cluster fails, the external observer group performs unified survival judgment on the whole distributed database cluster, the external observer group votes on the survival nodes in the distributed database cluster, the main nodes are elected, the rest of the survival nodes are standby nodes, and the data of the main nodes are copied to the standby nodes for synchronization.
Optionally, in one example of the above method, the survival decision includes determining whether a machine room or node in the distributed database cluster is available.
Optionally, in one example of the above method, the data synchronization of the master node to the standby node is implemented based on a semi-synchronous replication protocol.
Optionally, in one example of the above method, the data synchronization of the primary node to the backup node is implemented based on an additional entry (APPENDENTRIES) Remote Procedure Call (RPC).
Optionally, in one example of the above method, the method further comprises the master node waiting for the master node and the slave node to complete commit after responding to APPENDENTRIES RPC requests.
Optionally, in one example of the above method, the method further comprises optimizing the two-stage process of APPENDENTRIES RPC to be a one-stage process.
Optionally, in one example of the above method, the optimizing includes omitting a second phase of the two phases APPENDENTRIES RPC, and locally committing at the standby node directly upon completion of the first phase.
Optionally, in one example of the above method, the method includes backing up by a data flashback operation if the primary node does not achieve a consistent change of the backup node of the primary node.
According to another aspect of the embodiment of the application, an apparatus for node synchronization of a distributed database cluster based on an external observer group is provided, which comprises a unit for performing unified survival decision on the whole database cluster by the external observer group when the distributed database cluster fails, a unit for voting on the survival nodes in the distributed database cluster based on the external observer group, and selecting main nodes, wherein the rest of the survival nodes are standby nodes, and a unit for copying data of the main nodes to the standby nodes for synchronization.
Optionally, in one example of the above apparatus, the survival decision includes determining whether a machine room or node in the distributed database cluster is available.
Optionally, in one example of the above apparatus, the data synchronization of the master node to the standby node is implemented based on a semi-synchronous replication protocol.
Optionally, in one example of the above apparatus, the data synchronization of the primary node to the backup node is implemented based on an additional entry (APPENDENTRIES) Remote Procedure Call (RPC).
Optionally, in one example of the above apparatus, the master node waits for the master node and the slave node to complete commit after responding to APPENDENTRIES RPC requests.
Optionally, in one example of the above apparatus, the means for optimizing the two-stage process of APPENDENTRIES RPC is a one-stage process.
Optionally, in one example of the above apparatus, the means for optimizing includes means for omitting a second phase of the two phases APPENDENTRIES RPC, and performing local commit at the standby node directly upon completion of the first phase.
Optionally, in one example of the above apparatus, the apparatus includes means for backing up by a data flashback operation if the primary node does not achieve a consistent change of the backup node by the primary node.
An electronic device according to an embodiment of the application includes a memory for storing executable instructions, and a processor coupled to the memory for causing the electronic device to perform the methods described herein when the executable instructions are executed.
A computer readable storage medium according to an embodiment of the present application has stored thereon a computer program, which, when executed by a computer, is capable of performing the method described herein.
A computer program product according to an embodiment of the application comprises computer instructions capable of performing the method described herein when executed by a computer.
It is noted that one or more of the aspects above include the features specifically pointed out in the following detailed description and the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.
Detailed Description
The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is merely intended to enable one skilled in the art to better understand and thereby practice the examples of the present disclosure and is not intended to limit the scope of the present disclosure in any way.
The double living in the same city is to build two machine rooms in the same city or similar areas. The two rooms each bear a portion of the user traffic, each room deploys a distributed database cluster (e.g., zookeeper cluster), and performs real-time synchronization. In the same city double-activity deployment scenario, one of the machine rooms in the same city may fail, for example, the external network of the machine room is disconnected, so that the machine room cannot be accessed by the outside, or the machine room is powered off as a whole.
A distributed database cluster may have a plurality of node groups, each node group needs to elect a master node when the node group fails, and since a survival decision of a brain fracture needs to be performed when half of the nodes fail, the survival decision needs to be performed on the plurality of node groups, and the node groups are related to service data, the manual decision and the forced recovery operation of half of the nodes are high in complexity and difficult to realize production.
When half of the nodes in a node group fail, the leader and follower cannot be elected by the traditional "half election" mechanism. According to the application, by introducing the global unified external observer group of the cluster, when half nodes in the node group fail, unified survival judgment and election voting are realized through the external observer group, so that class 'half election' is realized, and zero loss of data is ensured. The machine room fault decision for the external observer group itself can be manually interposed, and because it is very light, does not involve business data and has only one external observer group, it is very easy to integrate into the operation and maintenance management platform, realizing production landing.
Specific embodiments will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a flow chart 100 for synchronizing nodes in a distributed database cluster based on a scheduling node of an external observer group.
Peer-to-peer implementations at the node synchronization level are described below with reference to the Raft protocol. For each node role in the Raft protocol, the leader corresponds to the master node 102 in FIG. 1, while the follower corresponds to the slave nodes 103, 104 in FIG. 1, the protocol log for Raft is implemented by node-based binlog in FIG. 1, and the state machine for Raft is implemented at the data table engine innodb layer in FIG. 1.
First, the survival decision is a pre-step of "election". By "electing a master" is meant voting by surviving candidate nodes for a new master node when the master node fails.
In the existing Raft protocol, remote Procedure Calls (RPCs) are used for communication between nodes, and the basic consistency algorithm requires only two types of RPCs. The two types of RPC calls of the Raft protocol are request voting (RequestVote) RPC and additional entry (APPENDENTRIES) RPC, respectively. In the Raft protocol, a candidate (candidate) initiates RequestVote RPC during election to "elect master" and then a leader (leader) initiates APPENDENTRIES RPCS for journaling and providing a heartbeat mechanism.
In one embodiment of the application, the "master" operation using RequestVote RPC is based on the scheduling node 101 of the external observer group detecting the master-slave copy relationship of the back-end data node to generate RequestVote RPC to vote, the most voted candidate node being selected as the master node 102. The more votes a candidate node gets, the more backup nodes data sync from that node, so the candidate node should be selected as the master node. The master node with the most standby nodes will obtain the most votes and become the master node.
According to one embodiment of the application, when half of the nodes are determined to survive, the act of voting is driven by the external observer group, which drives the node to cast a vote to its master node by examining the duplicate copy relationship between the nodes. More specifically, the external observer can obtain status information through an interface provided by the candidate nodes, check the latest active-standby copy relationship between the candidate nodes before the current fault occurs, if the previous active-standby copy relationship cannot be known or the situation that the active-standby copy relationship is equal occurs, perform "master selection" based on the progress of the class Raft log, the candidate nodes vote on the node with the latest Raft log progress, and if the Raft log progress of all the candidate nodes is the latest, select one candidate node as the master node through a random mechanism.
The current term period is marked by setting the local master version information. During a term period, the master node will be the master node until it fails and is not available, and the "master" process will not resume.
APPENDENTRIES RPC, in accordance with one embodiment of the present application, is used to synchronize replication changes of a primary data node to a backup data node. For example, synchronization of data between the primary node and the backup node may be achieved based on a semi-synchronous replication protocol.
According to another embodiment of the application, the two-stage process of APPENDENTRIES RPC may be optimized from an engineering implementation perspective as a one-stage process. That is, the second of the two phases APPENDENTRIES RPC is omitted, and the local commit is made at the standby node directly upon completion of the first phase.
According to yet another embodiment of the present application, if the master node 102 does not achieve a majority style (including master and backup) consistent change, it will be rolled back through a data flashback operation. More specifically, if the locally submitted content of the standby node does not finally achieve a most consistent change at the primary node, then it is subsequently rolled back by a data flashback operation in a data comparison with the primary node. For example, based on the operation log of the local backup node, the reverse order and reverse operation in the transaction needing rollback are performed, and the occurred change is changed back again. For example, one transaction is in sequence of insert 1 >3 update to 2 > delete 4, then the data flashback operation performed is insert 4 >2 update to 3 > delete 1.
The master node waits for the most-dispatching nodes (including the master and standby nodes) to complete the commit in response to the APPENDENTRIES RPC request. Completion commit here is to determine that the transaction is complete and persist. For example, a common way to implement persistence is journaling.
The multiple views (views) of the master node are maintained along with the increase and decrease of the nodes, and the number of the multiple views to be waited for each submission is automatically changed and maintained.
Fig. 2 illustrates a block diagram 200 of cross-machine room split determination and synchronization based on an external observer group at the same city dual activity deployment.
In one embodiment of the application, when co-city dual-activity deployment is performed, the node group for each piece of data will contain at least 4 nodes, and the number of nodes in the two rooms is the same. Above the node is an external observer group 202. The external observer group 202 includes one or more scheduling nodes 204-1, 204-2. The dispatching nodes of the external observer group are uniformly distributed in 2 machine rooms and are connected with all the nodes, so that the dispatching nodes can participate in the brain fracture survival judgment of the nodes.
In a co-city dual activity deployment, the conventional determination of numerous raft groups is simplified to a machine room brain fracture survival determination based on the zookeeper cluster 206 of the external observer group 202, according to one embodiment of the present application. Specifically, the scheduling nodes 204-1, 204-2, and the..once again, 204-n of the external observer group 202 are themselves stateless, with cross-machine room brain crack decisions and metadata synchronization by the zookeeper cluster 206. The zookeeper cluster 206 is responsible for cross-machine-room split determination for all scheduling nodes. For example, a scheduling node capable of communicating with the zookeeper cluster 206 is determined to be alive and a scheduling node not capable of communicating with the zookeeper cluster 206 will automatically exit. Therefore, the machine room where the surviving dispatching node is located is the surviving machine room, thereby realizing the brain fracture judgment across the machine room.
According to one embodiment of the application, the zookeeper cluster deploys an odd number of nodes, and the number of zookeeper nodes in the main machine room is greater than that in the standby machine room. If the connection between the main room and the backup room fails, but the two rooms themselves fail, the default main room is alive as long as the main room is alive. If the backup room fails, the host room may directly complete failover. If the main machine room fails, the zookeeper node of the standby machine room will not be available, but can be forcefully restored to a reduced new zookeeper cluster by simple steps. Since zookeeper is free of any business data and only stores running state temporary data, and is very lightweight, the process will be very fast and will not have any effect on the consistency of business data.
Next, the surviving scheduling node is determined to automatically participate in the "master" process of nodes that it can communicate with.
As described above, in the case where the number of nodes in two rooms is the same, if one room fails, the surviving other room contains half the total number of nodes of the two rooms. After the scheduling nodes of the external observer group participate in the primary selecting process, half of surviving nodes can be activated to vote, so that the class of the primary selecting is realized, wherein the external observer group additionally throws a ticket for the node with the highest ticket, and the primary selecting is realized under the condition that half of the nodes fail.
Therefore, the nodes are pushed to perform main selection through the voting of the external observer groups, and the data is written into the nodes which are required to be more than half of the nodes strictly, and the data written into the nodes are synchronized at least 1 node of two machine rooms each time, so that the synchronous data exist in the two machine rooms, thereby realizing zero loss of service data after main selection again and ensuring strong consistency of the data.
The application allows single cluster deployment when the distributed database clusters are in the same city and double activities, ensures that the clusters can be quickly recovered and data zero is lost under the condition of half nodes fault while ensuring the writing of a plurality of groups. The method simplifies the brain fracture judgment of the same-city double-activity scene of the cluster based on the globally unified external observer group, and the process only needs to be carried out once no matter the size of the cluster, and does not relate to the user data replication group, thereby greatly enhancing the data security of the operation.
Fig. 3 illustrates a flow chart 300 of a method of survival decision, voting, and synchronization for distributed database clusters based on external observer groups.
At block 302, when a distributed database cluster fails, an external observer group makes a unified survival decision for the entire distributed database cluster. The survival decision is used to determine whether a node or a machine room is available. For example, in a co-city dual-activity deployment, after one of the two rooms fails, it is determined whether the other room survives. Because there may be a brain crack in both machine room scenarios, when the machine room 1 cannot access the machine room 2, it is necessary to confirm whether the machine room 1 is alive or not, and whether the machine room 2 is malfunctioning or not.
At block 304, surviving nodes (i.e., candidates) in the distributed database cluster are voted for a master node (i.e., leader) based on the external observer group. The remaining surviving nodes become standby nodes (i.e., followers).
At block 306, the data of the master node is copied to the backup node for synchronization. Alternatively, data synchronization from the master node to the standby node may be implemented based on a semi-synchronous replication protocol.
Fig. 4 illustrates an apparatus 400 for survival decision, voting, and synchronization of distributed database clusters based on external observer groups.
The apparatus 400 includes a unit 402 for an external observer group to make a unified survival decision for the entire distributed database cluster when the distributed database cluster fails. The survival decision is used to determine whether a node or a machine room is available. For example, in a co-city dual-activity deployment, after one of the two rooms fails, it is determined whether the other room survives. Because there may be a brain crack in both machine room scenarios, when the machine room 1 cannot access the machine room 2, it is necessary to confirm whether the machine room 1 is alive or not, and whether the machine room 2 is malfunctioning or not.
The apparatus 400 includes means for voting on the basis of an external observer group for surviving nodes (i.e., candidates) in the distributed database cluster, and electing a master node (i.e., leader). The remaining surviving nodes become the cells 404 of the standby node (i.e., follower).
The apparatus 400 includes a unit 406 for copying data of a master node to a slave node for synchronization. Alternatively, the unit 406 may include a unit for implementing data synchronization from a primary node to a standby node based on a semi-synchronous replication protocol.
Fig. 5 illustrates an electronic device 500 for programming memory cells in a memory apparatus. The electronic device 500 shown in fig. 5 may be implemented in software, hardware, or a combination of software and hardware.
As shown in fig. 5, an electronic device 500 may include a processor 501 and a memory 502, where the memory 502 is to store executable instructions that, when executed, cause the processor 902 to perform the methods described herein.
Embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed, causes a computer to perform the methods described herein.
Embodiments of the present application also provide a computer program product comprising computer instructions which, when executed, cause a computer to perform the methods described herein.
It should be understood that all operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or the order of such operations, but rather should cover all other equivalent variations under the same or similar concepts.
The processor has been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and the overall design constraints imposed on the system. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital Signal Processor (DSP), field Programmable Gate Array (FPGA), programmable Logic Device (PLD), state machine, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software that is executed by a microprocessor, microcontroller, DSP, or other suitable platform.
The previous description is provided to enable any person skilled in the art to practice the various aspects of the application described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more". The term "some" means one or more unless specifically stated otherwise. All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.
It will be appreciated by persons skilled in the art that various modifications and variations may be made to the above disclosed embodiments without departing from the spirit of the application, which modifications and variations are to be considered within the scope of the application, which is to be defined by the appended claims.