Movatterモバイル変換


[0]ホーム

URL:


CN113127565B - Method and device for synchronizing distributed database nodes based on external observer group - Google Patents

Method and device for synchronizing distributed database nodes based on external observer group
Download PDF

Info

Publication number
CN113127565B
CN113127565BCN202110467944.XACN202110467944ACN113127565BCN 113127565 BCN113127565 BCN 113127565BCN 202110467944 ACN202110467944 ACN 202110467944ACN 113127565 BCN113127565 BCN 113127565B
Authority
CN
China
Prior art keywords
node
nodes
distributed database
database cluster
external observer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110467944.XA
Other languages
Chinese (zh)
Other versions
CN113127565A (en
Inventor
李韩
邹西山
林金怡
吴伟华
文其瑞
高孝鑫
司同
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Great Opensource Software Co ltd
China Unicom WO Music and Culture Co Ltd
Original Assignee
Beijing Great Opensource Software Co ltd
China Unicom WO Music and Culture Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Great Opensource Software Co ltd, China Unicom WO Music and Culture Co LtdfiledCriticalBeijing Great Opensource Software Co ltd
Priority to CN202110467944.XApriorityCriticalpatent/CN113127565B/en
Publication of CN113127565ApublicationCriticalpatent/CN113127565A/en
Application grantedgrantedCritical
Publication of CN113127565BpublicationCriticalpatent/CN113127565B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请提供了用于基于外部观察者进行分布式数据库节点同步的方法。所述方法包括:当分布式数据库集群出故障时,所述外部观察者组对整个数据库集群进行统一的存活判决;基于所述外部观察者组,对所述分布式数据库集群中的存活节点进行投票,选举出主节点,其余的存活节点为备节点;将所述主节点的数据复制到所述备节点,进行同步。本申请实现了在一半数量的节点存活的情况下,通过外部观察者组仍然能够选举出主节点。

The present application provides a method for synchronizing distributed database nodes based on external observers. The method includes: when a distributed database cluster fails, the external observer group makes a unified survival judgment on the entire database cluster; based on the external observer group, the surviving nodes in the distributed database cluster are voted to elect the master node, and the remaining surviving nodes are standby nodes; the data of the master node is copied to the standby node for synchronization. The present application realizes that when half of the nodes are alive, the master node can still be elected through the external observer group.

Description

Method and device for node synchronization of distributed database based on external observer group
Technical Field
The present application relates to distributed databases, and more particularly, to a method and apparatus for distributed database node synchronization based on an external observer group.
Background
The high reliability of the data of the distributed database is usually realized based on the redundancy of the data copies, and by storing the service data in a plurality of data copies at the same time, the distributed database cluster is still available and the data is safe under the fault scene of the data main copy.
The node strong synchronous protocol of the current mainstream is a distributed synchronous protocol based on multiple groups, such as Paxos protocol and Raft protocol.
The Raft protocol is a distributed coherency protocol that is currently in relatively wide use. In the Raft protocol, each node has a state, and the state machine is deterministic, with different states corresponding to different series of operations. The synchronous replication function is added on the basis of a state machine, and each node can save the position and the sequence of synchronous replication in a cache or a disk during synchronous replication. Briefly, multiple nodes start from the same initial state, execute the same series of commands, and produce the same final state.
The node states in the Raft protocol are three: candidates (candidates), leaders (leader), followers (follower). The candidate indicates that the node is a candidate for voting the master node, the candidate can be a leader or a follower, the leader indicates that the node is the elected master node and mainly processes communication and interaction with the client and also maintains data synchronization and heartbeat mechanism with the follower, and the follower indicates that the node is a standby node and is mainly responsible for forwarding a request of the client to the master node and then maintaining data synchronization with the master node.
The Raft protocol employs a "half election" mechanism to elect leaders and followers. Taking three nodes (N=3) as an example, firstly, the three nodes enter a candidate state, a Term period is created respectively, then, three nodes vote in the Term period, a certain node can become a leader only when the number of votes is greater than N/2 according to a 'half-election' mechanism, the node which becomes the leader sends heartbeats to other nodes, and the other nodes are converted from the candidate to a follower. Also employed are "over half election" mechanisms such as zookeeper clusters, which are not described in detail herein.
Assuming node B is the leader, nodes a and C are followers, clients send data to node B, and node B begins data storage and data synchronization when receiving instructions, which is not direct storage but instead employs two-phase commit. The two-stage submitting specific process comprises the steps that in the first stage, a leader node B receives a client data request and initiates requests to all follower nodes A and C, the follower nodes A and C receive the requests of the leader node B, confirm that the follower nodes A and C can execute data storage, respond to the leader node B by Ack, judge that if more than half of nodes return executable Ack after receiving Ack response, the leader node B stores data and records logs, returns successful responses to the client, and then in the second stage, the leader node B sends requests confirming execution to all follower nodes A and C again, and the follower nodes A and C store data and record logs. Note that follower nodes a and C also accept client requests, but do not process the requests, but rather forward the requests to leader node B for processing.
Traditionally, through multiple-dispatch (including leader and follower) data write acknowledgements and multiple-dispatch node failure masters, the protocol level ensures that data between data copies is strongly consistent and highly reliable with less than half of the total number of failed nodes. However, this conventional technique generally has the following drawbacks:
(1) The main selection can be realized only under the condition that more than half nodes survive, so that at least 3 machine rooms are required when the distributed database cluster is deployed across multiple centers of the machine rooms, and the cost is high;
(2) If the number of nodes in the main machine room a is greater than the number of nodes in the backup machine room B in the case of a co-city dual-active deployment, the entire database cluster will not be available and data may be lost when machine room a fails.
Therefore, there is a need to provide a method that allows for a strong consistency of database cluster data in case of failure of half the number of nodes, ensuring a high degree of data integrity and zero loss of data. In addition, the distributed database cluster is expected to construct a double-activity single large cluster in the same city, so that the multi-activity high reliability of the distributed database cluster across the machine room can be realized without 3 machine rooms, the cost is greatly reduced, and meanwhile, the high reliability and the strong consistency of the database cluster data are still ensured.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Aiming at the problems of high cost and inapplicability to common dual-activity deployment of the traditional distributed protocol, the application provides a method and a device for carrying out strong consistent synchronization on nodes of a distributed database based on an external observer. On the premise of ensuring strong consistency of a plurality of nodes of the distributed database cluster, the application allows nodes in the distributed database cluster to be still ensured to have strong consistency of data and zero loss of failure switching data under the condition that half of nodes fail by introducing scheduling nodes of an external observer group. The distributed database cluster can construct a double-activity single large cluster in the same city based on the technical scheme provided by the application, so that the multi-activity high reliability of the distributed database cluster across machine rooms can be realized without three machine rooms, the cost is greatly reduced, and meanwhile, the high reliability and the strong consistency of the data of the distributed database cluster are still ensured.
The method and the system support that in the case that half of nodes in the distributed database cluster fail, half of node voting master nodes are automatically selected based on an external observer group, and data zero loss is ensured, so that the method and the system are very suitable for common co-city dual-activity single large cluster deployment.
According to one aspect of the embodiment of the application, a method for synchronizing distributed database cluster nodes based on external observers is provided, wherein when the distributed database cluster fails, the external observer group performs unified survival judgment on the whole distributed database cluster, the external observer group votes on the survival nodes in the distributed database cluster, the main nodes are elected, the rest of the survival nodes are standby nodes, and the data of the main nodes are copied to the standby nodes for synchronization.
Optionally, in one example of the above method, the survival decision includes determining whether a machine room or node in the distributed database cluster is available.
Optionally, in one example of the above method, the data synchronization of the master node to the standby node is implemented based on a semi-synchronous replication protocol.
Optionally, in one example of the above method, the data synchronization of the primary node to the backup node is implemented based on an additional entry (APPENDENTRIES) Remote Procedure Call (RPC).
Optionally, in one example of the above method, the method further comprises the master node waiting for the master node and the slave node to complete commit after responding to APPENDENTRIES RPC requests.
Optionally, in one example of the above method, the method further comprises optimizing the two-stage process of APPENDENTRIES RPC to be a one-stage process.
Optionally, in one example of the above method, the optimizing includes omitting a second phase of the two phases APPENDENTRIES RPC, and locally committing at the standby node directly upon completion of the first phase.
Optionally, in one example of the above method, the method includes backing up by a data flashback operation if the primary node does not achieve a consistent change of the backup node of the primary node.
According to another aspect of the embodiment of the application, an apparatus for node synchronization of a distributed database cluster based on an external observer group is provided, which comprises a unit for performing unified survival decision on the whole database cluster by the external observer group when the distributed database cluster fails, a unit for voting on the survival nodes in the distributed database cluster based on the external observer group, and selecting main nodes, wherein the rest of the survival nodes are standby nodes, and a unit for copying data of the main nodes to the standby nodes for synchronization.
Optionally, in one example of the above apparatus, the survival decision includes determining whether a machine room or node in the distributed database cluster is available.
Optionally, in one example of the above apparatus, the data synchronization of the master node to the standby node is implemented based on a semi-synchronous replication protocol.
Optionally, in one example of the above apparatus, the data synchronization of the primary node to the backup node is implemented based on an additional entry (APPENDENTRIES) Remote Procedure Call (RPC).
Optionally, in one example of the above apparatus, the master node waits for the master node and the slave node to complete commit after responding to APPENDENTRIES RPC requests.
Optionally, in one example of the above apparatus, the means for optimizing the two-stage process of APPENDENTRIES RPC is a one-stage process.
Optionally, in one example of the above apparatus, the means for optimizing includes means for omitting a second phase of the two phases APPENDENTRIES RPC, and performing local commit at the standby node directly upon completion of the first phase.
Optionally, in one example of the above apparatus, the apparatus includes means for backing up by a data flashback operation if the primary node does not achieve a consistent change of the backup node by the primary node.
An electronic device according to an embodiment of the application includes a memory for storing executable instructions, and a processor coupled to the memory for causing the electronic device to perform the methods described herein when the executable instructions are executed.
A computer readable storage medium according to an embodiment of the present application has stored thereon a computer program, which, when executed by a computer, is capable of performing the method described herein.
A computer program product according to an embodiment of the application comprises computer instructions capable of performing the method described herein when executed by a computer.
It is noted that one or more of the aspects above include the features specifically pointed out in the following detailed description and the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.
Drawings
The disclosed aspects will be described below in conjunction with the drawings, which are provided to illustrate and not limit the disclosed aspects.
Fig. 1 illustrates a flow chart 100 for synchronizing nodes in a distributed database cluster based on a scheduling node of an external observer group.
Fig. 2 illustrates a block diagram 200 of cross-machine room split determination and synchronization based on an external observer group at the same city dual activity deployment.
FIG. 3 illustrates a flow chart of a method of survival decision, voting, and synchronization for distributed database clusters based on external observer groups.
Fig. 4 illustrates an apparatus 400 for survival decision, voting, and synchronization of distributed database clusters based on external observer groups.
Fig. 5 illustrates an electronic device 500 for programming memory cells in a memory apparatus.
Detailed Description
The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is merely intended to enable one skilled in the art to better understand and thereby practice the examples of the present disclosure and is not intended to limit the scope of the present disclosure in any way.
The double living in the same city is to build two machine rooms in the same city or similar areas. The two rooms each bear a portion of the user traffic, each room deploys a distributed database cluster (e.g., zookeeper cluster), and performs real-time synchronization. In the same city double-activity deployment scenario, one of the machine rooms in the same city may fail, for example, the external network of the machine room is disconnected, so that the machine room cannot be accessed by the outside, or the machine room is powered off as a whole.
A distributed database cluster may have a plurality of node groups, each node group needs to elect a master node when the node group fails, and since a survival decision of a brain fracture needs to be performed when half of the nodes fail, the survival decision needs to be performed on the plurality of node groups, and the node groups are related to service data, the manual decision and the forced recovery operation of half of the nodes are high in complexity and difficult to realize production.
When half of the nodes in a node group fail, the leader and follower cannot be elected by the traditional "half election" mechanism. According to the application, by introducing the global unified external observer group of the cluster, when half nodes in the node group fail, unified survival judgment and election voting are realized through the external observer group, so that class 'half election' is realized, and zero loss of data is ensured. The machine room fault decision for the external observer group itself can be manually interposed, and because it is very light, does not involve business data and has only one external observer group, it is very easy to integrate into the operation and maintenance management platform, realizing production landing.
Specific embodiments will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a flow chart 100 for synchronizing nodes in a distributed database cluster based on a scheduling node of an external observer group.
Peer-to-peer implementations at the node synchronization level are described below with reference to the Raft protocol. For each node role in the Raft protocol, the leader corresponds to the master node 102 in FIG. 1, while the follower corresponds to the slave nodes 103, 104 in FIG. 1, the protocol log for Raft is implemented by node-based binlog in FIG. 1, and the state machine for Raft is implemented at the data table engine innodb layer in FIG. 1.
First, the survival decision is a pre-step of "election". By "electing a master" is meant voting by surviving candidate nodes for a new master node when the master node fails.
In the existing Raft protocol, remote Procedure Calls (RPCs) are used for communication between nodes, and the basic consistency algorithm requires only two types of RPCs. The two types of RPC calls of the Raft protocol are request voting (RequestVote) RPC and additional entry (APPENDENTRIES) RPC, respectively. In the Raft protocol, a candidate (candidate) initiates RequestVote RPC during election to "elect master" and then a leader (leader) initiates APPENDENTRIES RPCS for journaling and providing a heartbeat mechanism.
In one embodiment of the application, the "master" operation using RequestVote RPC is based on the scheduling node 101 of the external observer group detecting the master-slave copy relationship of the back-end data node to generate RequestVote RPC to vote, the most voted candidate node being selected as the master node 102. The more votes a candidate node gets, the more backup nodes data sync from that node, so the candidate node should be selected as the master node. The master node with the most standby nodes will obtain the most votes and become the master node.
According to one embodiment of the application, when half of the nodes are determined to survive, the act of voting is driven by the external observer group, which drives the node to cast a vote to its master node by examining the duplicate copy relationship between the nodes. More specifically, the external observer can obtain status information through an interface provided by the candidate nodes, check the latest active-standby copy relationship between the candidate nodes before the current fault occurs, if the previous active-standby copy relationship cannot be known or the situation that the active-standby copy relationship is equal occurs, perform "master selection" based on the progress of the class Raft log, the candidate nodes vote on the node with the latest Raft log progress, and if the Raft log progress of all the candidate nodes is the latest, select one candidate node as the master node through a random mechanism.
The current term period is marked by setting the local master version information. During a term period, the master node will be the master node until it fails and is not available, and the "master" process will not resume.
APPENDENTRIES RPC, in accordance with one embodiment of the present application, is used to synchronize replication changes of a primary data node to a backup data node. For example, synchronization of data between the primary node and the backup node may be achieved based on a semi-synchronous replication protocol.
According to another embodiment of the application, the two-stage process of APPENDENTRIES RPC may be optimized from an engineering implementation perspective as a one-stage process. That is, the second of the two phases APPENDENTRIES RPC is omitted, and the local commit is made at the standby node directly upon completion of the first phase.
According to yet another embodiment of the present application, if the master node 102 does not achieve a majority style (including master and backup) consistent change, it will be rolled back through a data flashback operation. More specifically, if the locally submitted content of the standby node does not finally achieve a most consistent change at the primary node, then it is subsequently rolled back by a data flashback operation in a data comparison with the primary node. For example, based on the operation log of the local backup node, the reverse order and reverse operation in the transaction needing rollback are performed, and the occurred change is changed back again. For example, one transaction is in sequence of insert 1 >3 update to 2 > delete 4, then the data flashback operation performed is insert 4 >2 update to 3 > delete 1.
The master node waits for the most-dispatching nodes (including the master and standby nodes) to complete the commit in response to the APPENDENTRIES RPC request. Completion commit here is to determine that the transaction is complete and persist. For example, a common way to implement persistence is journaling.
The multiple views (views) of the master node are maintained along with the increase and decrease of the nodes, and the number of the multiple views to be waited for each submission is automatically changed and maintained.
Fig. 2 illustrates a block diagram 200 of cross-machine room split determination and synchronization based on an external observer group at the same city dual activity deployment.
In one embodiment of the application, when co-city dual-activity deployment is performed, the node group for each piece of data will contain at least 4 nodes, and the number of nodes in the two rooms is the same. Above the node is an external observer group 202. The external observer group 202 includes one or more scheduling nodes 204-1, 204-2. The dispatching nodes of the external observer group are uniformly distributed in 2 machine rooms and are connected with all the nodes, so that the dispatching nodes can participate in the brain fracture survival judgment of the nodes.
In a co-city dual activity deployment, the conventional determination of numerous raft groups is simplified to a machine room brain fracture survival determination based on the zookeeper cluster 206 of the external observer group 202, according to one embodiment of the present application. Specifically, the scheduling nodes 204-1, 204-2, and the..once again, 204-n of the external observer group 202 are themselves stateless, with cross-machine room brain crack decisions and metadata synchronization by the zookeeper cluster 206. The zookeeper cluster 206 is responsible for cross-machine-room split determination for all scheduling nodes. For example, a scheduling node capable of communicating with the zookeeper cluster 206 is determined to be alive and a scheduling node not capable of communicating with the zookeeper cluster 206 will automatically exit. Therefore, the machine room where the surviving dispatching node is located is the surviving machine room, thereby realizing the brain fracture judgment across the machine room.
According to one embodiment of the application, the zookeeper cluster deploys an odd number of nodes, and the number of zookeeper nodes in the main machine room is greater than that in the standby machine room. If the connection between the main room and the backup room fails, but the two rooms themselves fail, the default main room is alive as long as the main room is alive. If the backup room fails, the host room may directly complete failover. If the main machine room fails, the zookeeper node of the standby machine room will not be available, but can be forcefully restored to a reduced new zookeeper cluster by simple steps. Since zookeeper is free of any business data and only stores running state temporary data, and is very lightweight, the process will be very fast and will not have any effect on the consistency of business data.
Next, the surviving scheduling node is determined to automatically participate in the "master" process of nodes that it can communicate with.
As described above, in the case where the number of nodes in two rooms is the same, if one room fails, the surviving other room contains half the total number of nodes of the two rooms. After the scheduling nodes of the external observer group participate in the primary selecting process, half of surviving nodes can be activated to vote, so that the class of the primary selecting is realized, wherein the external observer group additionally throws a ticket for the node with the highest ticket, and the primary selecting is realized under the condition that half of the nodes fail.
Therefore, the nodes are pushed to perform main selection through the voting of the external observer groups, and the data is written into the nodes which are required to be more than half of the nodes strictly, and the data written into the nodes are synchronized at least 1 node of two machine rooms each time, so that the synchronous data exist in the two machine rooms, thereby realizing zero loss of service data after main selection again and ensuring strong consistency of the data.
The application allows single cluster deployment when the distributed database clusters are in the same city and double activities, ensures that the clusters can be quickly recovered and data zero is lost under the condition of half nodes fault while ensuring the writing of a plurality of groups. The method simplifies the brain fracture judgment of the same-city double-activity scene of the cluster based on the globally unified external observer group, and the process only needs to be carried out once no matter the size of the cluster, and does not relate to the user data replication group, thereby greatly enhancing the data security of the operation.
Fig. 3 illustrates a flow chart 300 of a method of survival decision, voting, and synchronization for distributed database clusters based on external observer groups.
At block 302, when a distributed database cluster fails, an external observer group makes a unified survival decision for the entire distributed database cluster. The survival decision is used to determine whether a node or a machine room is available. For example, in a co-city dual-activity deployment, after one of the two rooms fails, it is determined whether the other room survives. Because there may be a brain crack in both machine room scenarios, when the machine room 1 cannot access the machine room 2, it is necessary to confirm whether the machine room 1 is alive or not, and whether the machine room 2 is malfunctioning or not.
At block 304, surviving nodes (i.e., candidates) in the distributed database cluster are voted for a master node (i.e., leader) based on the external observer group. The remaining surviving nodes become standby nodes (i.e., followers).
At block 306, the data of the master node is copied to the backup node for synchronization. Alternatively, data synchronization from the master node to the standby node may be implemented based on a semi-synchronous replication protocol.
Fig. 4 illustrates an apparatus 400 for survival decision, voting, and synchronization of distributed database clusters based on external observer groups.
The apparatus 400 includes a unit 402 for an external observer group to make a unified survival decision for the entire distributed database cluster when the distributed database cluster fails. The survival decision is used to determine whether a node or a machine room is available. For example, in a co-city dual-activity deployment, after one of the two rooms fails, it is determined whether the other room survives. Because there may be a brain crack in both machine room scenarios, when the machine room 1 cannot access the machine room 2, it is necessary to confirm whether the machine room 1 is alive or not, and whether the machine room 2 is malfunctioning or not.
The apparatus 400 includes means for voting on the basis of an external observer group for surviving nodes (i.e., candidates) in the distributed database cluster, and electing a master node (i.e., leader). The remaining surviving nodes become the cells 404 of the standby node (i.e., follower).
The apparatus 400 includes a unit 406 for copying data of a master node to a slave node for synchronization. Alternatively, the unit 406 may include a unit for implementing data synchronization from a primary node to a standby node based on a semi-synchronous replication protocol.
Fig. 5 illustrates an electronic device 500 for programming memory cells in a memory apparatus. The electronic device 500 shown in fig. 5 may be implemented in software, hardware, or a combination of software and hardware.
As shown in fig. 5, an electronic device 500 may include a processor 501 and a memory 502, where the memory 502 is to store executable instructions that, when executed, cause the processor 902 to perform the methods described herein.
Embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed, causes a computer to perform the methods described herein.
Embodiments of the present application also provide a computer program product comprising computer instructions which, when executed, cause a computer to perform the methods described herein.
It should be understood that all operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or the order of such operations, but rather should cover all other equivalent variations under the same or similar concepts.
The processor has been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and the overall design constraints imposed on the system. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital Signal Processor (DSP), field Programmable Gate Array (FPGA), programmable Logic Device (PLD), state machine, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software that is executed by a microprocessor, microcontroller, DSP, or other suitable platform.
The previous description is provided to enable any person skilled in the art to practice the various aspects of the application described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more". The term "some" means one or more unless specifically stated otherwise. All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.
It will be appreciated by persons skilled in the art that various modifications and variations may be made to the above disclosed embodiments without departing from the spirit of the application, which modifications and variations are to be considered within the scope of the application, which is to be defined by the appended claims.

Claims (19)

Translated fromChinese
1.一种用于基于外部观察者组进行分布式数据库集群的节点同步的方法,包括:1. A method for synchronizing nodes of a distributed database cluster based on an external observer group, comprising:当分布式数据库集群出故障时,所述外部观察者组对整个所述分布式数据库集群进行统一的存活判决;When the distributed database cluster fails, the external observer group performs a unified survival decision on the entire distributed database cluster;基于所述外部观察者组,对所述分布式数据库集群中的存活节点进行投票,选举出主节点,其余的存活节点为备节点,其中所述外部观察者组在所述分布式数据库集群中的一半数量的节点出故障的情况下参与所述投票,其中,所述外部观察者组通过检查在出现所述故障之前的所述存活节点之间的最近一次的主备复制关系来驱动所述存活节点进行投票;如果主备复制关系均等,则所述存活节点都投票给Raft日志进度最新的存活节点;如果所有存活节点的Raft日志进度都是最新的,则通过随机机制选择一个存活节点作为主节点;Based on the external observer group, voting is performed on the surviving nodes in the distributed database cluster to elect a master node, and the remaining surviving nodes are standby nodes, wherein the external observer group participates in the voting when half of the nodes in the distributed database cluster fail, wherein the external observer group drives the surviving nodes to vote by checking the most recent master-slave replication relationship between the surviving nodes before the failure occurs; if the master-slave replication relationship is equal, the surviving nodes all vote for the surviving node with the latest Raft log progress; if the Raft log progress of all surviving nodes is the latest, a surviving node is selected as the master node through a random mechanism;将所述主节点的数据复制到所述备节点,进行同步。The data of the primary node is copied to the backup node for synchronization.2.根据权利要求1所述的方法,其中,所述存活判决包括:判断所述分布式数据库集群中的机房或节点是否可用。2. The method according to claim 1, wherein the survival judgment comprises: judging whether a computer room or a node in the distributed database cluster is available.3.根据权利要求1所述的方法,其中,所述主节点到所述备节点的数据同步是基于半同步复制协议来实现的。3. The method according to claim 1, wherein data synchronization from the primary node to the backup node is implemented based on a semi-synchronous replication protocol.4.根据权利要求1所述的方法,其中,所述主节点到所述备节点的数据同步是基于附加条目AppendEntries远程过程调用RPC来实现的。4. The method according to claim 1, wherein the data synchronization from the primary node to the backup node is implemented based on an AppendEntries remote procedure call (RPC).5.根据权利要求4所述的方法,还包括:所述主节点等待所述主节点和所述备节点对AppendEntries RPC请求做出响应后完成提交。5. The method according to claim 4 further comprises: the master node completes the submission after waiting for the master node and the backup node to respond to the AppendEntries RPC request.6.根据权利要求4所述的方法,还包括:优化所述AppendEntries RPC的两阶段过程为一阶段过程。6. The method according to claim 4 further includes: optimizing the two-phase process of the AppendEntries RPC to a one-phase process.7.根据权利要求6所述的方法,其中,所述优化包括:省略AppendEntries RPC的两个阶段中的第二阶段,直接在第一阶段完成时在所述备节点进行本地提交。7. The method according to claim 6, wherein the optimization comprises: omitting the second phase of the two phases of the AppendEntries RPC, and directly performing local submission on the standby node when the first phase is completed.8.根据权利要求1到7中任一项所述的方法,包括:如果所述主节点没有达成与所述备节点的一致变更,则将通过数据闪回操作回退。8. The method according to any one of claims 1 to 7, comprising: if the primary node does not reach a consistent change with the backup node, rolling back through a data flashback operation.9.一种用于基于外部观察者组进行分布式数据库集群节点同步的装置,包括:9. A device for synchronizing nodes of a distributed database cluster based on an external observer group, comprising:用于当分布式数据库集群出故障时,所述外部观察者组对整个所述分布式数据库集群进行统一的存活判决的单元;A unit for the external observer group to make a unified survival decision on the entire distributed database cluster when a distributed database cluster fails;用于基于所述外部观察者组,对所述分布式数据库集群中的存活节点进行投票,选举出主节点,其余的存活节点为备节点的单元,其中所述外部观察者组在所述分布式数据库集群中的一半数量的节点出故障的情况下参与所述投票,其中,所述外部观察者组通过检查在出现所述故障之前的所述存活节点之间的最近一次的主备复制关系来驱动所述存活节点进行投票;如果主备复制关系均等,则所述存活节点都投票给Raft日志进度最新的存活节点;如果所有存活节点的Raft日志进度都是最新的,则通过随机机制选择一个存活节点作为主节点;A unit for voting on the surviving nodes in the distributed database cluster based on the external observer group to elect a master node, with the remaining surviving nodes being standby nodes, wherein the external observer group participates in the voting when half of the nodes in the distributed database cluster fail, wherein the external observer group drives the surviving nodes to vote by checking the most recent master-slave replication relationship between the surviving nodes before the failure occurs; if the master-slave replication relationship is equal, the surviving nodes all vote for the surviving node with the latest Raft log progress; if the Raft log progress of all surviving nodes is the latest, a surviving node is selected as the master node through a random mechanism;用于将所述主节点的数据复制到所述备节点,进行同步的单元。A unit for copying the data of the primary node to the standby node for synchronization.10.根据权利要求9所述的装置,其中,所述存活判决包括:判断所述分布式数据库集群中的机房或节点是否可用。10. The device according to claim 9, wherein the survival judgment comprises: judging whether a computer room or a node in the distributed database cluster is available.11.根据权利要求9所述的装置,其中,所述主节点到所述备节点的数据同步是基于半同步复制协议来实现的。11. The device according to claim 9, wherein data synchronization from the primary node to the backup node is implemented based on a semi-synchronous replication protocol.12.根据权利要求9所述的装置,其中,所述主节点到所述备节点的数据同步是基于附加条目AppendEntries远程过程调用RPC来实现的。12. The apparatus according to claim 9, wherein the data synchronization from the primary node to the backup node is implemented based on an AppendEntries remote procedure call (RPC).13.根据权利要求12所述的装置,还包括:所述主节点等待所述主节点和所述备节点对AppendEntries RPC请求做出响应后完成提交。13. The apparatus according to claim 12, further comprising: the master node completes the submission after waiting for the master node and the backup node to respond to the AppendEntries RPC request.14.根据权利要求12所述的装置,还包括:用于优化所述AppendEntries RPC的两阶段过程为一阶段过程的单元。14. The apparatus according to claim 12, further comprising: a unit for optimizing the two-phase process of the AppendEntries RPC into a one-phase process.15.根据权利要求14所述的装置,其中,用于优化的所述单元包括:用于省略AppendEntries RPC的两个阶段中的第二阶段,直接在第一阶段完成时在所述备节点进行本地提交的单元。15. The apparatus according to claim 14, wherein the unit for optimizing comprises: a unit for omitting the second phase of the two phases of AppendEntries RPC and directly performing local commit on the standby node when the first phase is completed.16.根据权利要求9到15中任一项所述的装置,包括:用于如果所述主节点没有达成与所述备节点的一致变更,则将通过数据闪回操作回退的单元。16. The apparatus according to any one of claims 9 to 15, comprising: a unit for rolling back through a data flashback operation if the primary node fails to reach a consistent change with the backup node.17.一种电子设备,包括:17. An electronic device comprising:存储器,用于存储可执行指令;A memory for storing executable instructions;处理器,耦合至所述存储器,用于在执行所述可执行指令时使得所述电子设备执行如权利要求1-8中任一项所述的方法。A processor, coupled to the memory, is configured to cause the electronic device to perform the method as claimed in any one of claims 1 to 8 when executing the executable instructions.18.一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被计算机执行时能够执行如权利要求1-8中任一项所述的方法。18. A computer-readable storage medium having a computer program stored thereon, wherein the computer program can perform the method according to any one of claims 1 to 8 when executed by a computer.19.一种计算机程序产品,其包括计算机指令,所述计算机指令在被计算机执行时能够执行如权利要求1-8中任一项所述的方法。19. A computer program product comprising computer instructions, which, when executed by a computer, can perform the method according to any one of claims 1 to 8.
CN202110467944.XA2021-04-282021-04-28 Method and device for synchronizing distributed database nodes based on external observer groupActiveCN113127565B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110467944.XACN113127565B (en)2021-04-282021-04-28 Method and device for synchronizing distributed database nodes based on external observer group

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110467944.XACN113127565B (en)2021-04-282021-04-28 Method and device for synchronizing distributed database nodes based on external observer group

Publications (2)

Publication NumberPublication Date
CN113127565A CN113127565A (en)2021-07-16
CN113127565Btrue CN113127565B (en)2025-06-03

Family

ID=76781030

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110467944.XAActiveCN113127565B (en)2021-04-282021-04-28 Method and device for synchronizing distributed database nodes based on external observer group

Country Status (1)

CountryLink
CN (1)CN113127565B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113961349B (en)*2021-10-282022-09-06中国西安卫星测控中心QPID cluster control method and system
CN114500525B (en)*2021-12-242024-04-26天翼云科技有限公司Method, device, computer equipment and medium for updating nodes in distributed system
CN114726856A (en)*2022-02-282022-07-08重庆市先进区块链研究院Self-adaptive master selection method based on Raft
CN114448996B (en)*2022-03-082022-11-11南京大学 Consensus method and system based on redundant storage resources under the framework of separation of computing and storage
CN114513525B (en)*2022-04-192022-07-05北京易鲸捷信息技术有限公司Data consistency optimization method and system adopting cross-machine-room chain forwarding
CN114844891B (en)*2022-04-212024-04-12浪潮云信息技术股份公司Block chain consensus method and system based on Raft algorithm
CN116599841B (en)*2023-07-182023-10-13联通沃音乐文化有限公司Large-scale cloud storage system capacity expansion method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108616566A (en)*2018-03-142018-10-02华为技术有限公司Raft distributed systems select main method, relevant device and system
CN112073250A (en)*2020-09-172020-12-11新华三信息安全技术有限公司Controller cluster fault processing method and device, controller and controller cluster

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104933132B (en)*2015-06-122019-11-19深圳巨杉数据库软件有限公司Distributed data base based on the sequence of operation number has the right to weigh electoral machinery
US10496669B2 (en)*2015-07-022019-12-03Mongodb, Inc.System and method for augmenting consensus election in a distributed database
EP3553669B1 (en)*2016-12-302024-09-25Huawei Technologies Co., Ltd.Failure recovery method and device, and system
CN108063787A (en)*2017-06-262018-05-22杭州沃趣科技股份有限公司The method that dual-active framework is realized based on distributed consensus state machine
US10379966B2 (en)*2017-11-152019-08-13Zscaler, Inc.Systems and methods for service replication, validation, and recovery in cloud-based systems
CN111338767B (en)*2018-12-182023-09-29无锡雅座在线科技股份有限公司PostgreSQL master-slave database automatic switching system and method
CN110635941A (en)*2019-08-302019-12-31苏州浪潮智能科技有限公司Database node cluster fault migration method and device
CN112395640B (en)*2020-11-162022-08-26国网河北省电力有限公司信息通信分公司Industry internet of things data light-weight credible sharing technology based on block chain

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108616566A (en)*2018-03-142018-10-02华为技术有限公司Raft distributed systems select main method, relevant device and system
CN112073250A (en)*2020-09-172020-12-11新华三信息安全技术有限公司Controller cluster fault processing method and device, controller and controller cluster

Also Published As

Publication numberPublication date
CN113127565A (en)2021-07-16

Similar Documents

PublicationPublication DateTitle
CN113127565B (en) Method and device for synchronizing distributed database nodes based on external observer group
CN109729129B (en)Configuration modification method of storage cluster system, storage cluster and computer system
CN107832138B (en)Method for realizing flattened high-availability namenode model
JP5559821B2 (en) Method for storing data, method for mirroring data, machine-readable medium carrying an instruction sequence, and program for causing a computer to execute the method
CN106062717B (en) A distributed storage replication system and method
CN102142008B (en)Method and system for implementing distributed memory database, token controller and memory database
US7779295B1 (en)Method and apparatus for creating and using persistent images of distributed shared memory segments and in-memory checkpoints
US20150347250A1 (en)Database management system for providing partial re-synchronization and partial re-synchronization method of using the same
US9396076B2 (en)Centralized version control system having high availability
CN115794499B (en)Method and system for dual-activity replication data among distributed block storage clusters
CN110825763B (en)MySQL database high-availability system based on shared storage and high-availability method thereof
JP5292351B2 (en) Message queue management system, lock server, message queue management method, and message queue management program
WO2018068661A1 (en)Paxos protocol-based methods and apparatuses for online capacity expansion and reduction of distributed consistency system
JP4461147B2 (en) Cluster database using remote data mirroring
CN110830582B (en)Cluster owner selection method and device based on server
CN113905054A (en)Kudu cluster data synchronization method, device and system based on RDMA
CN114363350A (en)Service management system and method
CN109726211A (en)A kind of distribution time series database
CN116055563A (en)Task scheduling method, system, electronic equipment and medium based on Raft protocol
WO2025119102A1 (en)Two-level quorum method for effectively solving split brain of dual server rooms
CN105323271A (en)Cloud computing system, and processing method and apparatus thereof
Yang et al.Multi-active multi-datacenter distributed database architecture design based-on secondary development zookeeper
CN113518984A (en) database update
CN113138879A (en)Method and system for hybrid edge replication
CN120045626B (en) Distributed database system disaster recovery method and system based on cross-cluster data synchronization

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp