Movatterモバイル変換


[0]ホーム

URL:


US10713275B2 - System and method for augmenting consensus election in a distributed database - Google Patents

System and method for augmenting consensus election in a distributed database
Download PDF

Info

Publication number
US10713275B2
US10713275B2US15/200,975US201615200975AUS10713275B2US 10713275 B2US10713275 B2US 10713275B2US 201615200975 AUS201615200975 AUS 201615200975AUS 10713275 B2US10713275 B2US 10713275B2
Authority
US
United States
Prior art keywords
node
arbiter
primary
data
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US15/200,975
Other versions
US20170032007A1 (en
Inventor
Dwight Merriman
Eliot Horowitz
Andrew Michalski
Therese Avitabile
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MongoDB Inc
Original Assignee
MongoDB Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MongoDB IncfiledCriticalMongoDB Inc
Priority to US15/200,975priorityCriticalpatent/US10713275B2/en
Publication of US20170032007A1publicationCriticalpatent/US20170032007A1/en
Assigned to MONGODB, INC.reassignmentMONGODB, INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: MERRIMAN, DWIGHT, HOROWITZ, ELIOT, Avitabile, Therese, SCHWERIN, ANDREW MICHALSKI
Application grantedgrantedCritical
Publication of US10713275B2publicationCriticalpatent/US10713275B2/en
Activelegal-statusCriticalCurrent
Adjusted expirationlegal-statusCritical

Links

Images

Classifications

Definitions

Landscapes

Abstract

According to one aspect, a distributed database system is configured to manage write operations received from database clients and execute the write operations at primary nodes. The system then replicates received operations across a plurality of secondary nodes. Write operation can include safe write requests such that the database guaranties the operation against data loss once acknowledged. In some embodiments, the system incorporates an enhanced arbiter role the enables the arbiter to participate in cluster-wide commitment of data. In other embodiments, the enhanced arbiter role enables secondary nodes to evaluate arbiter operations logs when determining election criteria for new primary nodes.

Description

RELATED APPLICATIONS
This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application Ser. No. 62/188,097 entitled “SYSTEM AND METHOD FOR AUGMENTING CONSENSUS ELECTION IN A DISTRIBUTED DATABASE,” filed on Jul. 2, 2015, which application is incorporated herein by reference in its entirety.
BACKGROUND
Conventional distributed databases have become sophisticated in their architectures, and many have the ability to be customized to a variety of performance guarantees. Distributed databases executing eventual consistency models can be specifically tailored to achieve any number of durability guarantees. For instance, some distributed databases can be customized to target high durability of data writes, or to target low latency of writes, among other examples.
In some conventional systems, different architectures can be implemented to support automatic failover of master or primary nodes (i.e., nodes responsible for processing write operations) without loss of any committed data. One example approach is described in U.S. Pat. No. 8,572,031 incorporated by reference herein. Under that approach, primary nodes are elected to handle operations received from database clients, and replicate the operations to secondary nodes. To facilitate assignment of a primary node, a distributed database may include arbiter nodes that facilitate a consensus-based election of primary nodes between the primary and secondary nodes of the distributed database. As described in U.S. Pat. No. 8,572,031, the arbiter nodes can be configured to operate as observers of primary election, and further to participate in a primary election process to the extent of confirming a vote for a primary or a secondary node within the distributed database.
SUMMARY
Under conventional configurations of eventually consistent distributed databases, one or more sets of nodes can be responsible for receiving, executing, and replicating write operations throughout the distributed database. In such databases, various roles can be assigned to each node. For example, primary nodes can receive and execute write operations from database clients, and secondary nodes replicate the data hosted on the primary, replicate execution of the write operations on the primary, and elect primary nodes as necessary. In further examples, a distributed database can include arbiter nodes, which do not host data, but are configured to participate in elections of new primaries when they occur. Generally, conventional arbiter nodes do not participate in commitment operations for data within the distributed database.
According to one aspect, it is realized the conventional observer or arbiter nodes can be enhanced to facilitate cluster-wide commitment operations for data. For example, where sets of nodes are responsible for receiving, executing, and replicating write operations throughout the distributed database, arbiter nodes can be configured to maintain copies of the replicated operation log. In one example, by implementing the copy of the operation log on the arbiter node, the number of nodes that can respond to a consensus-based commit is increased. In some embodiments, arbiter operations can reduce database latency for data commitment. In other embodiments, the reduction in latency can be achieved with minimal overhead, for example, by implementing copies of the operation log on arbiter nodes within the distributed database.
In other embodiments, implementing enhanced arbiter nodes reduces the need for expensive hardware. For example, an enhanced arbiter node can replace a secondary node in a replica set. The hardware requirements for an enhanced arbiter is significantly less than the hardware required for a true secondary node (e.g., by virtue of hosting database date). Thus, the system requirements, computational complexity, and expense of implementing a distributed database supported by replica sets is reduced. In one example, the enhanced arbiters are configured to support the ordinary functions of the replica set (e.g., replication, data commitment, and primary election), and do so without the need of hosting additional database copies. In other aspects, enhanced arbiters enable a minimal replica set environment of one primary, one secondary, and one arbiter while still implementing majority based protocols (e.g., primary election and data commitment, among other options).
In further aspects, the arbiter role is defined such that the node having the arbiter role can also be configured to enhance a consensus-based election of a primary node. For example, consensus election can be based on freshest date for a node's data. For instance, the distributed database can be configured to promote a secondary node having the freshest data in response to failure of a primary node. According to one embodiment, the arbiter's operation log can be used to make an out of date secondary node electable.
In one implementation, the arbiter's operation log represents a window of operations that can be replayed to any secondary node. In some examples, the election process can be configured to take into account the ability to use the arbiter operation log to bring any secondary node as up to date as the last operation in the arbiter operation log. Thus, the operations from the arbiter copy of the operation log can bring the out of date secondary node current, and make any out of date secondary node a viable candidate for primary election.
According to one embodiment, a node can be defined in a distributed database with an arbiter or observer role that facilitates commitment operations and can also improve execution of primary election functions. As discussed, in conventional implementations arbiter or observer nodes do not participate in data commitment and serve a limited role in electing primary nodes. One example of a commitment requirement includes a replication rule that more than half of the nodes responsible for a given portion of data must receive the replication operation prior to commitment. Once more than half of the nodes responsible for the data have received the replicated operation, the durability of the operation is assured. Enhanced arbiters, which can include copies of the replicated operation log, can participate in cluster-wide commitment like secondary nodes, even in examples where the arbiter node does not host a copy of the distributed data. Upon receiving a replicated operation entry from a primary node, the arbiter can report receipt and any count towards the majority requirement can be increased accordingly based on the receipt by the arbiter of the replication operation. In some embodiments, implementing enhanced arbiter roles can decrease the time required for commitment to write operations, reducing latency and improving the operation of the database system. Further, enhancing arbiter roles in a distributed database improves durability of operations with little overhead and thus increases the efficiency of the database system.
According to one aspect, a computer implemented method for managing a distributed database is provided. The method comprises establishing at least one primary node within a plurality of nodes, wherein the plurality of nodes comprise the distributed database system and the distributed database system provides responses to database requests from database clients, restricting processing of write operations received from the database clients to the at least one primary node, establishing at least one secondary node configured to host a replica of the primary node database from the plurality of nodes, establishing at least one arbiter node configured to host an operation log of operations executed by the primary node, replicating from the primary node at least one log entry reflecting the write operations executed by the primary node to the at least one secondary node and the at least one arbiter node, and confirming a safe write operation received from a database client in response to determining that the safe write operation has been executed at a threshold number of the plurality of nodes responsible for the associated data, wherein determining that the safe write operation has been executed at the threshold number includes an act of: determining from the nodes responsible for the associated data including the at least one arbiter and the at least one secondary node, that the threshold number of nodes has entered the log entry reflecting the write operation into a respective operation log or has executed the write operation on a respective replica of the primary node database.
According to one embodiment, the threshold number of the plurality of nodes is determined by reaching a majority of the number of nodes making up the at least one primary, the at least one secondary node, and the at least one arbiter node. According to one embodiment, the safe write operation includes any one or more members of a group of a data modification request, an update request, a data creation request, and a data deletion request.
According to one embodiment, the act of establishing the at least one arbiter node includes defining an arbiter role restricting arbiter nodes from servicing database client requests (e.g., read requests). According to one embodiment, the act of establishing the at least one arbiter node includes defining an arbiter role restricting arbiter nodes from hosting a replica of the primary node database. According to one embodiment, the method further comprises identifying missing operations by at least one of the secondary nodes. According to one embodiment, the method further comprises querying by the at least one of the secondary nodes the at least one arbiter node to determine the missing operations is available from the arbiter's operations log.
According to one embodiment, the method further comprises communicating at least one log entry associated with the missing operation from the arbiter's operation log to the at least one secondary node, and executing the missing operation at the at least one secondary node. According to one embodiment, the method further comprises an act of electing a new primary node from the plurality of secondary nodes, wherein the plurality of secondary nodes are configured to analyze the operation log of the at least one arbiter to determine eligibility to be elected the new primary node. According to one embodiment, the act of replicating occurs asynchronously as a default mode, and wherein the act of replicating is confirmed responsive to a safe write request.
According to one aspect, a distributed database system is provided. The system comprises at least one processor operatively connected to a memory, the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise: a configuration component configured to establish a role associated with each node in a plurality of nodes, wherein the role component is configured to establish at least one primary node with a primary role, at least a plurality of secondary nodes with a secondary role, and at least one arbiter node with an arbiter role, a replication component configured to restrict write operations received from client computer systems to the at least one primary node having the primary role, the at least one primary node configured to: execute write operations on a respective copy of at least a portion of the database data and generate a log entry for execution of the operation, replicate the log entry to the plurality of secondary nodes and the at least one arbiter node, the plurality of secondary nodes configured to: host a copy of data hosted by the at least one primary node, execute the log entry received from the primary node to update a respective copy of the data, the at least one arbiter node configured to update an operation log of operations performed by the at least one primary node, wherein the replication component is further configured to acknowledge a safe write operation responsive to determining that operation has been executed, replicated, or logged by a threshold number of the nodes responsible for the data associated with the safe write operation.
According to one embodiment, wherein the threshold number of the plurality of nodes is determined by upon reaching a majority of the number of nodes making up the at least one primary, the at least one secondary node, and the at least one arbiter node responsible for target data of the safe write operation. According to one embodiment, the safe write operation includes any one or more members of a group including a data modification request, an update request, a data creation request, and a data deletion request. According to one embodiment, the configuration component is further configured to establish the arbiter role to restrict arbiter nodes from servicing database client requests (e.g., read requests). According to one embodiment, the configuration component is further configured to establish the arbiter role to restrict arbiter nodes from hosting a replica of the primary node database.
According to one embodiment, the replication component is further configured to identify missing operations on at least one of the secondary nodes responsive to receipt of the at least one log entry reflecting the write operations executed by the primary node. According to one embodiment, the at least one of the secondary nodes is configured to execute missing operations received from an arbiter operation log. According to one embodiment, the replication component is further configured to query the at least one arbiter node to determine that the missing operations is available from the arbiter's operations log. According to one embodiment, the replication component is further configured to trigger communication of the at least one log entry associated with the missing operation from the arbiter's operation log to the at least one secondary node.
According to one aspect, a computer implemented method for managing a distributed database is provided. The method comprises establishing at least one primary node within a plurality of nodes, wherein the plurality of nodes comprise the distributed database system and the distributed database system provides responses to database requests from database clients, restricting processing of write operations received from the database clients to the at least one primary node, establishing at least one secondary node configured to host a replica of the primary node database from the plurality of nodes and update the replica responsive to received replicated operations from the at least one primary node, establishing at least one arbiter node configured to host an operation log of operations executed by the primary node, and electing a new primary responsive to detecting a failure of the at least one primary node, wherein electing the new primary node includes: executing a consensus protocol between the at least one secondary node and the at least one arbiter node associated with the failed at least one primary node, evaluating election criteria at the at least one secondary node during an election period, communicating by a respective one of the at least one secondary node a self-vote responsive to determining the respective one of the at least one secondary node meets the election criteria, or communicating a confirmation of a received vote of another secondary node responsive to determining the another secondary node has better election criteria, and evaluating by the respective one of the at least one secondary node the operation log of the at least one arbiter node to determine election criteria associated with most recent data as part of the determining the another secondary node has better election criteria or determining the respective one of the at least one secondary node meets the election criteria.
According to one embodiment, the method further comprises an act of communicating a self-vote responsive to querying the at least one arbiter's operation log and determining that the at least one arbiter's operation log includes operation as recent or more recent then received election information.
According to one embodiment, the election criteria include at least one of, or any combination of, or two or three or more of: a most recent data requirement, a location requirement, a hardware requirement, and an uptime requirement. According to one embodiment, responsive to the respective one of the at least one secondary node determining the at least one arbiter's operation log includes the most data, triggering an update of the respective one of the at least one secondary node's replica of the primary database. According to one embodiment, the method further comprises an act of committing write operations responsive to safe write requests received from database clients, wherein committing the write operations includes counting logging the write operation at the at least one arbiter towards any commitment requirements.
According to one aspect, a distributed database system is provided. The system comprises at least one processor operatively connected to a memory, the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise: a configuration component configured to establish a role associated with each node in a plurality of nodes, wherein the role component is configured to establish at least one primary node with a primary role, at least one secondary node with a secondary role, and at least one arbiter node with an arbiter role, a replication component configured to restrict write operations received from client computer systems to the at least one primary node having the primary role, the at least one primary node configured to: execute write operations on a respective copy of at least a portion of the database data and generate a log entry for execution of the operation, replicate the log entry to the at least one secondary node and the at least one arbiter node, the at least one secondary node configured to: host a copy of data hosted by the at least one primary node, execute the log entry received from the primary node to update a respective copy of the data, the at least one arbiter node configured to update an operation log of operations performed by the at least one primary node, an election component configured to elect a new primary responsive to detecting a failure of the at least one primary node, wherein the election component is further configured to execute a consensus protocol between the at least one secondary node and the at least one arbiter associated with the failed at least one primary node, evaluate election criteria at the at least one secondary node during an election period, communicate a self-vote for a respective one of the at least one secondary node responsive to determining the respective one of the at least one secondary node meets the election criteria, or communicate a confirmation of a received vote of another secondary node responsive to determining the another secondary node has better election criteria; and evaluate the operation log of the at least one arbiter node to determine election criteria associated with most recent data as part of the determining the another secondary node has better election criteria or determining the respective one of the at least one secondary node meets the election criteria.
According to one embodiment, the election component is distributed through at least some of the nodes comprising the distribute database. According to one embodiment, the election component is distributed at least through at least one secondary node and the at least one arbiter node.
According to one aspect, a computer implemented method for electing a primary node in a distributed database is provided. The method comprises electing a new primary responsive to detecting a failure of the at least one primary node, wherein electing the new primary node includes executing a consensus protocol between the at least one secondary node and the at least one arbiter associated with the failed at least one primary node, evaluating election criteria at the at least one secondary node during an election period, communicating by a respective one of the at least one secondary node a self-vote responsive to determining the respective one of the at least one secondary node meets the election criteria or communicating a confirmation of a received vote of another secondary node responsive to determining the another secondary node has better election criteria; and evaluating by the respective one of the at least one secondary node the operation log of the at least one arbiter node to determine election criteria associated with most recent data as part of the determining the another secondary node has better election criteria or determining the respective one of the at least one secondary node meets the election criteria. According to one embodiment, the method further comprises any one or more of the preceding methods or method elements.
According to one aspect, a distributed database system is provided. The system comprises at least one processor operatively connected to a memory, the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise: an election component configured to elect a new primary responsive to detecting a failure of the at least one primary node, wherein the election component is further configured to: execute a consensus protocol between the at least one secondary node and the at least one arbiter associated with the failed at least one primary node, evaluate election criteria at the at least one secondary node during an election period, communicate a self-vote for a respective one of the at least one secondary node responsive to determining the respective one of the at least one secondary node meets the election criteria or communicate a confirmation of a received vote of another secondary node responsive to determining the another secondary node has better election criteria, and evaluate the operation log of the at least one arbiter node to determine election criteria associated with most recent data as part of the determining the another secondary node has better election criteria or determining the respective one of the at least one secondary node meets the election criteria. According to one embodiment, the system further comprises any one or more or any combination of the preceding system embodiments or system elements.
According to one aspect, a distributed database system is provided. The system comprises at least one primary node configured execute write operations from database clients and replicate a log entry associated with the write operation to a at least one secondary node and at least one arbiter node, at least one processor operatively connected to a memory, the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise a replication component configured to restrict write operations received from the database client to the at least one primary node having a primary role, the at least one arbiter node configured to update an operation log of operations performed by the at least one primary node, wherein the replication component is further configured to acknowledge a safe write operation responsive to determining that operation has been executed, replicated, or logged by a threshold number of the nodes from the at least one arbiter node and the at least one secondary node responsible for the data associated with the safe write operation. According to one embodiment, the system further comprises any one or more or any combination of system embodiments or system elements.
According to one aspect, a computer implemented method for managing a distributed database is provided. The method comprises replicating from a primary node at least one log entry reflecting write operations executed by the primary node to at least one secondary node and at least one arbiter node; and confirming a safe write operation received from a database client in response to determining that the safe write operation has been executed at a threshold number of the plurality of nodes hosting the associated data or an operation log, wherein determining that the safe write operation has been executed at the threshold number includes: determining that operation has been executed, replicated, or logged by a threshold number of the nodes of the at least one arbiter node and the at least one secondary node responsible for the data associated with the safe write operation.
According to one embodiment, the method includes establishing a primary node configured host a copy of database data, execute write operations, maintain an operation log and replicate executed operations to secondary nodes and arbiter nodes. According to one embodiment, the method includes establishing a secondary node configured host a copy of database data, execute replicated write operations, maintain an operation log. According to one embodiment, the method includes establishing an arbiter node configured to maintain an operation log based on received replication operations from the primary. In another embodiment, the arbiter node does not host a client accessible copy of the database data.
According to one embodiment, determining that operation has been executed, replicated, or logged includes, acts of: determining that the at least one arbiter node has entered the log entry reflecting the write operation into a respective operation log, and determining that at least one of the at least one secondary node has executed the write operation on a respective replica of the primary node database or entered the log entry reflecting the write operation into a respective operation log. According to one embodiment, the method further comprises any one or more or any combination of the preceding methods or individual method elements from any one or more the preceding methods.
Still other aspects, embodiments, and advantages of these exemplary aspects and embodiments, are discussed in detail below. Any embodiment disclosed herein may be combined with any other embodiment in any manner consistent with at least one of the objects, aims, and needs disclosed herein, and references to “an embodiment,” “some embodiments,” “an alternate embodiment,” “various embodiments,” “one embodiment” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment. The appearances of such terms herein are not necessarily all referring to the same embodiment. The accompanying drawings are included to provide illustration and a further understanding of the various aspects and embodiments, and are incorporated in and constitute a part of this specification. The drawings, together with the remainder of the specification, serve to explain principles and operations of the described and claimed aspects and embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
Various aspects of at least one embodiment are discussed herein with reference to the accompanying figures, which are not intended to be drawn to scale. The figures are included to provide illustration and a further understanding of the various aspects and embodiments, and are incorporated in and constitute a part of this specification, but are not intended as a definition of the limits of the invention. Where technical features in the figures, detailed description or any claim are followed by reference signs, the reference signs have been included for the sole purpose of increasing the intelligibility of the figures, detailed description, and/or claims. Accordingly, neither the reference signs nor their absence are intended to have any limiting effect on the scope of any claim elements. In the figures, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every figure. In the figures:
FIG. 1 illustrates a block diagram of an example architecture for a distributed database system, according to one embodiment;
FIG. 2 illustrates a block diagram of an example architecture for a distributed database architected with database shards, according to one embodiment;
FIG. 3 illustrates a block diagram of an example distributed database system, according to one embodiment;
FIG. 4 illustrates an example process flow for committing operations with arbiter participation, according to one embodiment;
FIG. 5 illustrates an example process flow for consensus election with arbiter participation, according to one embodiment;
FIG. 6 illustrates an example process flow for determining election information, according to one embodiment;
FIGS. 7A-D illustrate example pseudo-code that can be executed by the system, replication engine, and/or any system component in whole or in part, according to some embodiments; and
FIG. 8 is a block diagram of an example special purpose computer system specially configured to host one or more elements of a distributed database system on which various aspects of the present invention can be practiced.
DETAILED DESCRIPTION
According to one aspect, a distributed database system can be enhanced via operation of arbiter nodes. Unlike conventional systems, the arbiter nodes can be modified to participate in data commitment operations, resulting in reduced latency and increased durability of write operations. According to one embodiment, the arbiter role can be defined for the distributed database such that the arbiter node does not operate as and cannot be elected to a primary role, but can augment data commitment and/or consensus election protocols.
According to one aspect, a primary node with the primary role handles write operations received from database clients and distributes write operations executed on the primary node's copy of the data to secondary nodes hosting respective copies of the data. The secondary nodes receive logged operations form the primary and execute the logged operations on their respective copies of the distributed data. Under the enhanced arbiter role, arbiter nodes in the distributed database also receive logged operations from the primary node. According to one embodiment, the arbiter nodes maintain logged operations for a period of time, for example, specified by database configuration files. As the period of time is exceeded, the oldest logged operations can be deleted, archived, or imaged and new operations added to the arbiter's operation log. In some examples, the arbiter's operation log is configured to maintain a rolling window of operations that can be accessed by the system and executed to update any secondary node.
According to one embodiment, primary nodes can be assigned at initiation of a distributed database, and in other embodiments can also be initially elected from a pool of secondary nodes. In some examples, a distributed database system defines an arbiter role such that the arbiter node cannot be elected as a primary node, but can participate in electing a primary node. In further examples, the arbiter's operation log can be used in evaluating which secondary nodes are electable to be primary. In one example, the availability of the arbiter's operation log can be evaluated by the distributed database (and, for example, secondary nodes), and used to determine if the secondary node as updated by the arbiter operation log would be the best candidate for election. In some embodiments, the arbiter role is defined on the system such that arbiters do not host client accessible copies of the database data (e.g., arbiters are not configured to respond to client requests for data), or are not configured for responding to client request. In one example, arbiter nodes are specially configured to operate as observers of the database operations for client requests on data (e.g., logging operations only, facilitating consensus elections, etc.). In another example, the arbiter node role is configured so that an arbiter cannot operate as a secondary node and further cannot be elected as a primary node.
Examples of the methods and systems discussed herein are not limited in application to the details of construction and the arrangement of components set forth in the following description or illustrated in the accompanying drawings. The methods and systems are capable of implementation in other embodiments and of being practiced or of being carried out in various ways. Examples of specific implementations are provided herein for illustrative purposes only and are not intended to be limiting. In particular, acts, components, elements and features discussed in connection with any one or more examples are not intended to be excluded from a similar role in any other examples.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. Any references to examples, embodiments, components, elements or acts of the systems and methods herein referred to in the singular may also embrace embodiments including a plurality, and any references in plural to any embodiment, component, element or act herein may also embrace embodiments including only a singularity. References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements. The use herein of “including,” “comprising,” “having,” “containing,” “involving,” and variations thereof is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. References to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms.
FIG. 1 illustrates a block diagram of anexample architecture100 for a distributed database system that provides at least eventual consistency in the database data via replication of write operations to other nodes hosting the same data. The distributeddatabase system100 is further configured to include one or more arbiter nodes having an arbiter role (e.g.,178,180, and182).
In some embodiments of the distributed database system, enhanced arbiter processing requires a configuration permitting or enabling arbiter participation in operation log execution. The operation log setting can be enabled for all arbiter nodes across an entire database, or can be enabled for groupings of arbiter nodes within the database (e.g., logical groupings of data, partitions of data, for specific nodes, groups of nodes, etc.), in further examples, arbiters can be configured individually to participate in capturing and storing operation log information. In various embodiments, once enabled the participation of the arbiter nodes in data commitment and/or consensus election protocols improves the operation of conventional database systems, for example, through increased durability of database data, reduced latency in commitment operations, and increased resource utilization across the distributed database, among other examples.
According to one embodiment, an arbiter node can participate in safe write operations on the distributed database. Safe write operations can be requested by database clients such that once the write operations is acknowledged by the database system the operation/data will not be lost. In some examples, the distributed database system is configured to acknowledge safe write operations only in response to a threshold number of systems executing the operation. Based on participation of the arbiter nodes via copying the operation to a respective operation log, the threshold number can be greater, can be reached faster, and can increase utilization of database resources, the execution of any of which improves the database system over conventional approaches. In further embodiments, the distributed database can include a default mode of operation whereby writes are replicated asynchronously and are not acknowledged, unless a safe write request is made to the database.
In one example, the distributeddatabase100 can be architected to implement database shards. Sharding refers to the process of separating the database into partitions and each partition can be referred to as a “chunk.” Conventional databases such as network-based, file-based, entity-based, relational, and object oriented databases, among other types, can be configured to operate within a sharded environment and each can benefit from the implementation of the enhanced arbiter role with or without sharding. In various embodiments, enhanced arbiter nodes can be implemented with any other types of databases, database organization, hybrid and/or database management systems thereby reducing complexity in reconciliation and/or ordering operations, and thereby improving the performance of the database system.
As discussed, the distributeddatabase system100 can be specially configured as a sharded cluster including enhanced arbiter nodes. A sharded cluster is the grouping of shards that collectively represent the data within the database A sharded cluster typically comprises multiple servers (e.g.,102-108) hosting multiple chunks (e.g.,152-174). Chunks establish non-overlapping partitions of data wherein a designated field known as a “shard key” present in all datum composing the data defines a range of values from some minimum value to some maximum value. The arbitrary collection of chunks residing on any given node at any given time are collectively referred to as a shard. In further embodiments, the distributed database can include a balancer configured to migrate chunks from shard to shard in response to data growth. According to one embodiment, the cluster can also include one or more configuration servers (e.g.,110-114) for metadata management, and operation routing processes (e.g.,116-118). Metadata for the sharded cluster can include, for example, information on the ranges of data stored in each partition, information associated with managing the sharded cluster, partition counts, number of shard servers, data index information, partition size constraints, data distribution thresholds, among other options.
According to some embodiments, the metadata for the sharded clusters includes information on whether the database permits participation of arbiter nodes in data commitment and/or consensus election protocols. In other embodiments, arbiter participation can be specified at data handling nodes of the distributed database (e.g., primary or secondary nodes), and/or where configurations specify what nodes are responsible for hosting data. In some examples, the distributed database system is configured to accept and enforce configuration settings that enable or disable arbiter participation in data commitment and/or election protocols for a database, a group of database nodes, subsets of the database, logical groupings of the data within the database, etc. In one example, the setting can be stored as configuration metadata.
Each chunk of data (e.g.,152-174) can be configured to reside on one or more servers executing database operations for storing, retrieving, managing, and/or updating data. Each chunk can be hosted as multiple copies of the data hosted on multiple systems. In one example, each chunk of data (e.g.,152-174) can be hosted by a replica set (e.g., a group of systems with copies of respective database data). In other embodiments, one replica set can host multiple chunks of data.
Configurations within a sharded cluster can be defined by metadata associated with the managed database and can be referred to as shard metadata. Shard metadata can include information on collections within a given database, the number of collections, data associated with accessing the collections, database key properties for a given collection, ranges of key values associated with a given partition and/or shard within a given collection, systems hosting copies of the same data, specification of nodes and roles (e.g., primary, secondary, arbiter, etc.) to provide some examples.
Shard metadata can also include information on whether arbiter participation in data commitment and/or election protocols is permitted. In some embodiments, the shard metadata can be managed dynamically to include information on a last write operation, processed or received. The information on the last write operation can be used in selecting a node in the database to handle subsequent or even simultaneous write operations to the database.
According to one embodiment, underlying a sharded cluster is a set of nodes that maintains multiple copies of the sharded data, for example, copies of the chunks of data. According to one aspect, the set of nodes can be configured as a replica set, as described in in U.S. Pat. No. 8,572,031 incorporated herein by reference in its entirety. Each replica set is made up a number of nodes include at least a primary node which maintains a primary copy of the data and a secondary node which receives replicated operations from the primary. More generally, a node in the distributed database is any processing entity that is responsible for a portion of the database data or management functions associated with the database data. In one example, a node can include a database instance executing on a stand-alone server. In other examples, a node can host multiple database instances
Various implementations of sharded databases may incorporate replica sets and are discussed with respect to co-pending U.S. Patent Application Publication 2012-0254175, incorporated herein by reference in its entirety. The sharded databases discussed may be modified to include enhanced arbiter configurations (e.g., arbiters configured to participate in data commitment, arbiters configured to participate in secondary updating for election, and various combinations thereof) and utilize the various aspects and embodiments discussed herein.
Returning toFIG. 1, the three dots illustrated next to the system components indicate that the system component can be repeated. In some embodiments, adding additional shards, configuration servers, copies of partitions, and/or shard routing processes can increase the capacity of the distributed database system. The operation router processes116-118 handle incoming requests from clients120 (e.g., applications, web services, user initiated requests, application protocol interfaces, etc.). The router processes116-118 are configured to provide a transparent interface to handle database requests. In particular,client120 need not know that a database request is being served by a sharded database. The operation router processes receive such client requests and route the database requests to the appropriate chunks(s), e.g.,152-174 on shards102-108.
According to some embodiments, the operation router processes are configured to identify primary nodes for handling the write operation from a plurality of database instances (e.g., each node or instance capable of handling the write operation). In further embodiments, enhanced arbiter nodes can be implemented in multi-writer distributed database, for example, as described in provisional patent application 62/180,232 filed on Jun. 16, 2015 incorporated by references in its entirety and included as Appendix A. Once a write operation is received by a primary node hosting the data, the primary node executes the write operation specified and records the operation to an operation log. The operation can be distributed throughout the database and the copies of the respective data.
According to some embodiments, the primary nodes implement an eventually consistent replication methodology whereby the operation is copied to any secondary nodes also responsible for that data. The secondary nodes execute the operation to bring the secondary nodes' data into agreement. In further embodiments, arbiter nodes are configured to receive and store each operation from the operation log distributed from the primary node. In some examples, the arbiter maintains a copy of the operations executed at the primary node. In further examples, the arbiter copy of the operation log can be limited in time to reduce overhead on the distributed database and minimize storage requirements. In some implementations, system configurations can specify five minute or ten minute windows for stored operations. The system configurations can specify a limited window of operations that are preserved in any arbiter copy. In one example, as the window is exceeded the oldest operations can be ejected (or archived) from an arbiter copy of the log in response to the receipt of new operations.
As discussed,FIG. 1 illustrates one example architecture for a distributed database. In other embodiments, sharding is not implemented and the database data is managed entirely via replica sets. In yet others, the database can be a hybridization of sharded elements and replica set elements. Shown inFIG. 2 is an example replica set200, having aprimary node202, asecondary node204 and anarbiter node206. As discussed, primary nodes in a replica set (e.g.,202) handle write operations on a primary database copy (e.g.,208) and replicate the write operations throughout the database by executing each write operation and creating a log entry (e.g., in operation log210) for the changes to the database. The primary nodes replicate the log entry to secondary nodes (e.g.,204). Secondary nodes host copies of the data and update the data (e.g.,212) based on replicated operations received from the primary node. The secondary nodes maintain a local operation log (e.g., operation log214). Enhanced arbiter nodes are configured to receive logged operations from the primary node as well. Instead of using the logged operation to execute changes on a local copy of the database, as would be the case with secondary nodes, the enhanced arbiter node saves the logged operation directly into a local operation log (e.g., at216).
In some embodiments, the arbiter nodes do not host copies of the database data, but only maintain operations received from a primary node in an operation log. In one example, the arbiters (e.g.,206) are configured to maintain recent operations in a respective operation logs (e.g.,216). Once the oldest operation in the log reaches a threshold and/or the operation log reaches a size boundary, the oldest operations can be archived or deleted, so that stale operations do not fill up the operation log. In some embodiments, the threshold and/or size can be set as database configurations. In one example, the system can be configured to permit a larger operation log on arbiter nodes. In some embodiments, the arbiter nodes have greater capacity and/or additional resources freed up by not maintaining copies of the database data.
In further embodiments, operation routing processes associated with a replica set can receive state information from the configuration processes specifying what roles have been assigned to each node and/or server in a distributed database. The router processes can use the state information to direct replication operations to the various servers in the cluster. Alternatively, each server can be provided information on other members of the clusters, and routing of replication operations can be handled, for example, by a primary node.
In response to a write or read request from a client, the request is communicated to a routing process to determine which database node is needed to respond with the appropriate data. In some embodiments, multiple routing processes can be implemented to route client requests to nodes having the appropriate data.
Routing process (e.g.,116 or118 ofFIG. 1) can be configured to forward a write operation to the appropriate node hosting the request data. According to other embodiments, and for example, a distributed database comprising only replica sets directing writes and reads from clients occurs via client driver configuration. In one example, the client drivers direct request to nodes with an appropriate role in a replica set (including, for example, directing writes to only primaries). In further example, the replica set members can be configured to return errors to protect against invalid requests from client drivers. In one instance, the replica set members are configured to return errors in response to write requests made against secondary nodes, protecting the replica set against invalid operations.
Once the appropriate node hosting the request data is selected, the node captures any relevant data, performs any writes, and returns the results of the execution (if necessary). In some examples, and in particular for sharded implementations, the routing process can be configured to merge the results from multiple nodes, as necessary, and communicate the result to the client. In some examples, the routing process communicates with an application layer that manages communication with the client.
As discussed, arbiters (e.g.,206) can be configured to facilitate commitment of a write operation. In some embodiments, the distributed database is configured with a durability guarantee such that once operations have reached greater than half of the nodes responsible for the written data, no loss of the data will occur. According to one embodiment, the arbiters increase the number of nodes that can be counted towards the majority requirement. In some embodiments, a primary node or the router process can be configured to execute a data query operation to confirm receipt of a replicated operation at a secondary node. Once a majority of secondary nodes have reported receipt, the operation is deemed committed. Arbiters can be configured to enhance this process by, for example, reporting when the arbiter has received a write operation and/or stored the write operation to its copy of the operation log. According to one embodiment, the distributed database system is configured to treat the arbiter having a copy of the operation log as a secondary for the purposed of distributed commitment protocols. Thus, as shown inFIG. 2 once a replicated operation reaches either the secondary204 or thearbiter206, the operations is deemed committed and will not be lost upon failures of the servers making up the cluster.FIG. 2 illustrates a minimal implementation of a replica set (e.g., one primary, one secondary, and one arbiter). In other implementations, larger numbers of secondary nodes can be available, as well as larger numbers of arbiters. Given nine nodes in a replica set comprised of one primary, secondary nodes, and enhanced arbiter nodes a write operation will be deemed committed upon reaching or being executed at any five of the nodes, including the primary node. Via the enhanced arbiter role the distributed database system is configured to handle the loss (or unreachability) of any four secondary nodes (e.g., given nine nodes in a replica set). The guarantee is valid even if arbiters or primary node are members of the four unreachable nodes.
As primary nodes are configured to handle write request, various embodiments are configured to ensure availability of a primary node even during network partitions or failure events. For example, if the primary server or node is unreachable for any period of time, an election process is triggered to establish a new primary server or node.FIG. 3 is block diagram of an embodiment of a distributeddatabase300. The distributed database can include areplication engine304 configured to process client data requests302 (e.g., reads, writes (e.g., any modification to the database data), and any access to the data) and return data from the database in response to the client requests302, as appropriate. Thereplication engine304 and/or distributeddatabase system300 can be configured to accept write operations and direct any writes or modification to the database data to a primary node within the distributed database. According to one embodiment, thereplication engine304 and/orsystem300 can be configured to assign roles to a plurality of nodes within a distributed database. The plurality of nodes is configured to host the database data and/or manage operations on the database data.
According to one embodiment, each node in the database that hosts a portion of the database data is initialized with at least a secondary node role. Secondary nodes are configured to host copies of the database data and can respond to read requests for database data. Primary nodes are configured to handle write operations against the database data (including, for example, any new data, new records, any modification, deletion, etc.). In some embodiments, primary nodes are elected from a group of secondary nodes. Thereplication engine304 can be configured to execute election protocols when the database is first brought online or in response to a failure of a primary node. The replication engine can execute a variety of election protocols. In one embodiment, the replication engine is configured to follow a consensus-based election of a primary node, wherein a majority of available nodes must agree as to the best secondary node to take on primary responsibility. In further embodiments, arbiter nodes are configured to vote during the election protocol and thereplication engine304 can assign any number of nodes an arbiter role.
Typical election criteria can include electing only nodes with the freshest or newest data. Is some implementations, each secondary node in the distributed database will attempt to elect itself and broadcast a vote to the other nodes. Each node checks received votes and if its data is fresher votes for itself. Such protocols can be enhanced by the presence of the operation log stored on arbiters. For example, a secondary node can be configured to vote for itself if an arbiter log contains fresher data than a received vote even if the secondary node's data is older than a received vote. Thus, a secondary node that would not qualify for election can be made electable by availability of the arbiter's operation log.
According to further embodiments, thereplication engine304 can also be configured to facilitatedata commitment306 by counting receipt of a replicated operation at an arbiter towards a majority requirement for committing write operations. In some embodiments, the presence of a replicated operation at an arbiter is counted towards any data commitment requirement.
According to some embodiments, thesystem300 and/orengine304 can implement additional system components to manage certain replication, election, and/or configuration functions within the distributed database. For example, thesystem300 and/orengine304 can instantiate aconfiguration component308 configured to manage configuration settings within the distributed database. In one example, theconfiguration component308 generates a user interface that enables specification of roles (e.g., primary, secondary, and arbiter) for nodes of a distributed database. Theconfiguration component308 can be further configured to enable customization of data commitment requirements. In one implementation, the configuration component defines a default commitment rule that an operation must reach a majority of nodes before the operation is committed. Theconfiguration component308 can be configured to allow customization of the default process.
In another embodiment, thesystem300 and/orengine304 can include areplication component310 configured to distribute write operations executed at primary nodes to secondary nodes for execution and to arbiter nodes for storage. The replication component can be further configured to request acknowledgements from secondary nodes and/or arbiter nodes in response to client requests. For example, a database client may request a safe write operation. The database system can be configured to determine that a threshold number of nodes (e.g., greater than half) have received the operations before acknowledging the safe write (e.g., where the write can be for data modification, data deletion, or data creation).
In one example, the replication component can be configured to execute an acknowledgement operation (e.g., getlasterror( )) to confirm receipt of replicated operations, for example, at secondary nodes. The acknowledgement function can retrieve information on the most recent operation executed on a node in the database, including for example, the most recent operation log entry made on an arbiter node.FIG. 4, described in greater detail below, illustrates one example process,400 for replicating write operations with enhanced arbiter participation.Process400 can be executed by thereplication component310, thereplication engine304, and/or thesystem300.
In further embodiments, thesystem300 and/orengine304 can include anelection component312. Theelection component312 can be configured to define the execution criteria for a primary node election. For example, theelection component312 can restrict election of a primary node to secondary nodes, and secondary nodes having the freshest data. In another example, theelection component312 can factor geographic position in election protocols. In some embodiment, theelection component312 is configured to oversee the election of the best new primary node based on a variety of factors. Each of the factors can be given a weight such that each secondary node votes for itself if it has the best weighted election value. For nodes with equivalent data, a geographic position can be evaluated to give one a better election value than another. In one example, the geographic position value is weighted to select nodes more proximate to the failed primary. In another example, location can be expressed in terms of a position within a rack in a datacenter in lieu of or in addition to geographic position. In one implementation, a secondary within the same rack as a failed primary can be favored over secondary nodes in nearby racks and/or secondary nodes in different geographic locations.
In one example system, the election protocol establishes a consensus by evaluating votes received from participating secondary systems to generate a quorum or consensus of reporting systems. In one example, a particular node can be voted for as the next primary system based on a query against the other nodes in the database to determine which node has the freshest data. Once the vote identifying a particular secondary system as the most up-to-date (or in another example, the server with the best location) reaches a threshold number of quorum participants, that secondary system is confirmed as the new primary. As a result, the elected system role/state is changed from secondary to primary and the remaining secondary systems set the new primary as the source for database update operations. According to some embodiments, the election does not require complete participation of the secondary nodes, and typically, only a majority of the secondary systems need to respond. The system propagates the change in the primary server to the remaining secondary nodes and the remaining secondary nodes update their configurations accordingly. The secondary servers then perform the operations necessary to bring the remaining secondary nodes in sync with the new primary database.FIG. 5, described in greater detail below, illustrates one example process,500 for electing a new primary in response to detecting a failure in a primary node.
According to one embodiment, thereplication engine304 and/or components308-312 can be distributed across a number of nodes making up the distributed database. In one example, the replication engine can execute across a plurality of shard servers in a sharded cluster, and can execute among members of a replica set. As discussed, a replica set can be configured to perform asynchronous replication across a series of nodes, with various processes implemented to handle recovery of primary node operations within the replica set, including consensus-based election of a primary node. Replica set configuration ensures high availability of the data. By modifying the replica sets to include arbiter operation logs as disclosed herein high availability of data reads can also be accompanied by increased durability and/or increase efficiency in durability for data writes.
In one example, a replica set can be a set of n servers, frequently three or more, each of which contains a replica of the entire data set for a given shard. One of the n servers in a replica set will always be a primary node. If the primary node replica fails, the remaining replicas are configured to automatically elect a new primary node. In various embodiments, replica sets can support a number of database configurations, including, sharded database configurations. For example, any logical partition in a distributed database can be maintained by a replica set. According to one embodiment, the enhanced arbiter role and the node configured to be the enhanced arbiter enables a minimal replica set configuration (e.g., one primary, one secondary and one arbiter) while preserving the execution of standard replica set operations (e.g., consensus election of primary, majority based commitment, etc.). Without the arbiter, a replica set would require two secondary nodes for the primary, in order to implement majority elections, and majority based commitment operations.
Process400,FIG. 4, illustrates an example of a process for replicating write operations with arbiter participation in a distributed database system. Given a set of nodes on which a database is implemented,process400 begins with receiving a write operation (e.g., data modification, data created, or data deletion) at402. Prior to execution ofprocess400, nodes in distributed database can be assigned primary and secondary roles or primary nodes can be elected from groups of secondary nodes (including, for example, via process500). Assignment of a primary node can occur as part of an initialization at start-up. In one alternative, assignment can occur based on the set of nodes that make up a replica set electing the primary at startup. Initialization can also include full replication of a database from one node to other node in the set. For example, a node may be added or initialized into a replica set using a synchronization operation that causes the node to capture a complete copy of database data as it exists on another node. Once synchronization is complete, replication operations can proceed for that node.
In some implementations, a single primary node provides a writable copy of a portion of database data, where write operations performed on the primary node are replicated asynchronously to all of the primary's secondary nodes which are responsible for at least the same data. The primary node replicates operations, for example, writes, by generating an operation log that reflects the operations performed on the primary database for example at404. The operations can be transmitted asynchronously from the primary node to its respective secondary nodes, at406. In some settings, the secondary nodes are configured to periodically query the operation log of the primary node to determine any operations that should be retrieved and executed. According to one embodiment, the operation log is configured to be part of the database itself. In another embodiment, the operation log is configured to not exceed a maximum size.
As operations occur they can be logged until the maximum log size is obtained, at which time the oldest operations are discarded or archived in favor of the newer operations. The operation log thus reflects a window of time for operations that can be replicated based on the permitted size of the operation log. The larger the size of the operation log, the greater the tolerance for downtime of nodes in the replica set. In one example, an operation log can be configured to a maximum size of 5-10% of the node's hard drive space. Other sizing for the operation log can be employed. For example, arbiter operation logs can be larger than operation logs maintained on nodes that also host data for the database. In some examples, database settings specify operation log size based on role (e.g., secondary roles have one setting, arbiter nodes have another, and the primary role can mirror the secondary or have yet another setting).
Each operation in the log can be associated with a time and/or an increasing value so that an order can be determined for each operation. In one example, a monotonically increasing value is employed and is associated with each operation. In one alternative, each operation can be time stamped. In one embodiment, the time stamp reflects the time of the primary node. Based on analysis of a first and last operation, a maximum operation log time can be determined. The maximum operation log time can be used in conjunction with replication operations to identify systems too far out of synchronization to replay operations from the log and thus require refreshing of the entire database. According to some embodiments, where enhanced arbiters hold a larger operation log than other nodes, this larger log can extend the window for which a node can be recovered by replaying operations.
In some embodiments, the operation log and all database metadata can be stored in the same types of data structures used to store application data. Maintaining the operation log and database metadata in this manner enables the system to reuse the storage and access mechanisms implemented for the database data. In some embodiments, primary and secondary nodes can be also configured with a local database which is not replicated. The local database can be configured to maintain information on local state. For example, a secondary node can maintain information on its lag time (any delay between synchronization with primary), time of last applied operation, address of primary node, as examples. Specific node configurations can also be configured in the node's local database.
In one embodiment, a secondary node executes a query against a primary node to determine all operations on the primary with a time stamp equal or greater than the last applied operation time stamp in its local database. In another embodiment, the secondary node can query the primary node to determine all operations on the primary with an operation value (the increasing value) greater than or equal to the operation value last executed on the secondary. Arbiter nodes can also be configured to query the primary node, and for example,process400 can be executed with enhanced arbiters at408 YES. In one example, arbiter nodes operate similarly to a secondary node with the data related operation.
If the distributed database does require enhanced arbiters,408 NO. If not implemented, the primary node is still configured to respond to a client write operation by processing the write operation (e.g.,404) and replicate the operation to any secondary nodes. Without the arbiter node, a database architect is unable to take advantage of the reduced requirement for implementing an arbiter node as opposed to a secondary node.
According to one embodiment, a client requests database access through application protocol interfaces (APIs). An API can be configured to execute a driver that can identify a primary node in a replica set. In one example, a driver program is configured to connect to the entire replica set and identify any primary. The API, and/or an associated driver, can be configured to retain information on any identified primary node. In the event of primary node failure an error can be returned when a request asks a non-primary node to perform primary only operations, e.g., a write. In response to such an error, the API and/or any associated driver can be configured to re-identify a new primary node.
As discussed, the primary node generates an operation log for each database write operation, and, for example, the operation is replicated asynchronously to the secondary nodes at406 and the secondary nodes execute the operation from the primary node's operation log at410. According to one embodiment, the secondary nodes also record operation records to a secondary local operation log to track applied operations. During generation of the operation log on the primary node, each operation can be assigned a monotonically increasing value or other unique identifier to facilitate replication.
In further embodiments, each replicated operation can be associated with information specific to the primary node. For example, an identifier for the primary node can be assigned, and/or a time stamp can be assigned based on the primary node time and included in any communicated replication operation. The information specific to the primary node (e.g., time, primary identifier, monotonically increasing value, etc.) can be used to verify operation ordering and to identify any missing operations.
In one embodiment, the information specific to the primary node (e.g., timestamp, monotonically increasing operation value, primary node identifier, etc.) can be used by the system to determine a maximum operation time (i.e., time associated with the most recent operation) and the maximum operation time can be stored on a node for reporting or retrieval. In one example, the maximum operation time can be defined based on a monotonically increasing value and thus can be used to identify how up-to-date a secondary node's database is. In another example, timestamps for each operation can be used to determine how up-to-date the secondary node's database is. Various functions can request a maximum operation time from a node to determine the respective state of the node's database.
Returning to process400, as each secondary node executes the operation an operation count can be updated at412. For example, each secondary can acknowledge the operation to the primary node. In one alternative, the primary node can query the secondary nodes to determine execution and increase an operation count. If the count exceeds a threshold number of nodes (e.g., greater than half of responsible nodes)414 YES, the primary acknowledges the safe write operation at416.
Once an operation has been replicated at a threshold number of nodes, the operations can be guaranteed to be retained. For example, where the threshold number of nodes represents a majority of the nodes in a replica set, even in light of a failed primary, the operation that has reached the majority of nodes will be retained. Although automatic failover processing can result in lost data, an operation becomes durable once replicated across a majority of the nodes within any replica set. In one example, during a failover scenario an operation having reached a majority of nodes will be present on any node subsequently elected primary, preserving the operation. According to one embodiment, transactions that have not replicated to a majority of nodes in the replica set can be discarded during a failover scenario. For example, election of a new primary identifies a secondary node with the freshest data, and reintegration of the failed primary can result in loss of any data not present on the new primary.
Returning to408 YES,process400 can be executed in conjunction with enhanced arbiter nodes. At418, the operation executed at404 is replicated to secondary nodes at406, and enhanced arbiter nodes at418. Once the arbiter has stored the operation into its operation log, an operation count can be increased at420. In one example, each arbiter can acknowledge the operation to the primary node. In one alternative, the primary node can query the arbiter nodes to determine execution and increase an operation count (e.g., at420). If the count exceeds a threshold number of nodes (e.g., greater than half of responsible nodes)414 YES, the primary acknowledges the safe write operation at416. If the threshold test at414 fails414 NO, process continues until a sufficient number of secondary nodes or arbiter nodes have stored or executed the operations (e.g., via410 and412 or418 and420). In some embodiments, the determination of whether a threshold has been reached is made at the primary hosting the data being written.
In one embodiment, the threshold being tested at414 is based on greater than half of the responsible nodes receiving the operation. If enhanced arbiters are being used they are included in the count of responsible nodes. In other words, the number of responsible nodes is equal to the nodes hosting the data being affected plus any arbiters maintaining a replication log for the data being affected. Once a majority of nodes are reached—the operation is deemed committed.
In some settings, a primary node can be configured to block write operations when secondary nodes are too far behind the primary node in performing their replicated write operations. In one example, a maximum lag value can be configured for a replica set that triggers a primary node to block write operations when the maximum lag value is exceeded. In one embodiment, the maximum lag time can be expressed as a maximum lag time for a threshold number of nodes. If the number of number nodes with a lag time in excess of the maximum exceeds the threshold, the primary node blocks write operations. In one implementation, nodes may report their lag time to the replication component periodically.
In another implementation, queries can be executed against nodes in the replica set to determine lag time. In some settings, secondary nodes can request that a primary node block write operations in response to lag time. Lag time can also be calculated and/or reported on by, for example, arbiter nodes based on queried maximum operation time. Additionally, arbiter nodes can report on status messages from secondary nodes that reflect maximum operation time for the given node. In some embodiments, secondary nodes are configured to provide reporting on status, and in some examples, can be configured to track status information on other nodes in a replica set. In one example, based on an operation log at the arbiter node, the system can automatically synchronize secondary nodes that exceed the maximum lag value but are still within the operation log window of the arbiter.
In some embodiments, nodes can be prevented from taking on the role of a primary to prevent data loss. In particular, transient failures of communication and even failure of an entire datacenter's power can occur in routine operation. In one embodiment, by configuring each node with a local uptime counter, a node can also include uptime information when in consideration for election to primary status. Requiring eligibility checks, for example based on uptime, can prevent data loss in the event of transient failures and even where a datacenter loses power. As the nodes in a replica set are restored, depending on the order in which the nodes return to operation, a secondary node could trigger a failover process. Failover procedures can result in the loss of data that has not replicated to a majority of nodes. Limiting a primary election process to eligible nodes can minimize resulting data loss.
According to one example, during replication and, for example, execution ofprocess400, a secondary node can identify based on the operation received whether there are any missing operations (e.g., prior operations which would lead to inconsistency if subsequent operations were executed out of order). In the event of missing operations, a secondary node can be configured to halt replication and enter an offline state. Once in the offline state, a node may require intervention to restore function. In further examples, a node can be automatically returned from halted replication by refreshing the entire database for the node.
According to some embodiments, if a node goes offline and comes back, the node is configured to review any accessible operation log (e.g., local operation log, current primary operation log, arbiter operation log, etc.). If that node cannot find an operation log that has all the operations from the intervening time, the node is configured to require a full resync in order to return to an active state. As discussed above, with an enhanced arbiters, the enhanced arbiters can be configured with an operation log spanning a longer time period than the primary node itself, thus the window of downtime that does not require a full resync is extended.
In one alternative example, a secondary node can be configured to synchronize its data by querying an enhanced arbiter node to retrieve operations from the arbiter operation log. If capable of being update by the arbiter, the secondary node can return to an online status.
In some embodiments, the transaction log of the operations performed on the primary node can reflect optimizations and/or transformations of the operations performed at the primary node. For example, increment operations performed on the master database can be transformed into set operations. In some examples, operations performed on the primary can be merged prior to generating an entry on the transaction log reducing the overall number of operations replicated to the secondary nodes.
Additional embodiments may execute different processes for replicating write operations with arbiter participation. According to one embodiment, an example process can include receipt of a write request at a primary node of a replica set. The primary node executes the write request, and logs the operation. Secondary nodes can query the primary node periodically to obtain information on the primary's operation log. In one example, secondary nodes are configured to retrieve new operation log entries via a “tailable cursor.” Tailable cursors are conceptually equivalent to the tail UNIX command with the -f option (i.e. with “follow” mode). After the primary executes any write operation (e.g., a client insert new or additional documents into a collection), the tailable cursor executed by the secondary node identifies the subsequent entry into the primary's operation log. The secondary node captures the new operation and executes it against the secondary node's copy of the database data.
According to another embodiment, primaries do not proactively communicate to the secondary nodes. Rather the secondary nodes query the primary node's operation log. In further embodiments, to validate a threshold for replication has been reached, another asynchronous process can be used. The asynchronous commitment protocol can be configured to check a status board for write operations to determine if the write operations have been completed by a threshold number of nodes. During execution of the commitment protocol, the status board can be updated (e.g., asynchronously) by the secondary nodes and/or any enhanced arbiters when they receive and/or execute an operation.
To provide another example algorithm for replication, the operations performed by a five member replica set is described:
    • 1) Primary A: receives a write request. A thread T executed on the primary handles this operation, writes the operation to log, and goes to sleep with a timer.
    • 2) Secondary Nodes and Arbiter Nodes: constantly stream an operation log from primary (e.g., via tailable cursor). As the secondary nodes execute the logged operations (e.g., writes) in batches, the secondary nodes update the primary node on the most recent executed operation. Arbiter nodes copy the logged operations to their operation log and notify the primary of the most recent logged operation. In one example, the secondary nodes and arbiter nodes write to a status board on the primary node to notify the primary of replication status.
    • 3) Thread T wakes up and checks its status board. Secondary Node B has written operations from the log until operation650, Secondary Node C has written operations up to670, Secondary D has written operations to650, Arbiter Node E has logged operations up to operation650. With the given stats of the nodes—only A and C have operation654. No majority has been reached for any operations after650. Thread T goes to sleep.
    • 4) Secondary Node B receives a batch of operations, up to operation670. Secondary Node B applies the operations, and writes to the primary status board “done through670.”
    • 5) Thread T wakes up and checks status board again. Thread T identifies Secondary nodes B and C have applied operations more recent than operation654—so including the primary node A, a majority a majority of nodes have executed and/or logged the operation.
    • 6) Responsive to determining that the majority has been reach, thread T acknowledges to the client that the write is complete.
      • Alternative to 4)-6): at 4) Arbiter Node E receives a batch of operations up to operation670. Arbiter Node E records the batch of operations into its operations log and writes to the primary status board on the primary “done through670.” 5) and 6) proceed as above but on the basis that nodes C and E as well as the primary node has executed and/or logged the operations up to operation670.
In some implementations, in addition to hosting read only replicas of the primary database the secondary nodes are configured to assist in the operation of the distributed database or in a replica set. In particular, the secondary nodes participate in protocols to elect a new primary node in the event of failures. Such protocols can be based on establishing a new primary node based on a quorum of participating nodes. Such a quorum protocol can be configured to require majority participation, for example, or can be configured require a threshold number of participants prior to completing any quorum protocol.
In some embodiments, each secondary node can be configured to participate in an election protocol that establishes by quorum comprising a threshold number of nodes that a particular node should be the new primary node. For example, the secondary node can be configured to join and/or announce membership in a group of secondary nodes that have also identified a particular node as the next primary node. Once the number of members in the group/quorum reaches a threshold number, the elected node can be assigned a primary role.
In one example, an arbiter system can collect status information on quorum participants. The arbiter system can be further configured to communicate the quorum result and/or trigger the status change to primary on another. In some embodiments, the quorum protocol is configured to require that a majority of the nodes responsible for the written data participate in the quorum prior to sending an acknowledgement. According to one embodiment, once a quorum is obtained the secondary node identified by the quorum becomes the new primary. Client drivers will re-direct write request to the new primary node.
One example of an election process includes querying all other nodes for their most recently applied operation. In one example, a maxappliedoptime function can be executed to capture information on a most recently applied operation. The information can be captured from timestamps, operation values indicative of order, among other options.
In one example, the election protocol is defined to elect a node with the most recent data. For a node that determines it has the freshest data set (e.g., the most recently applied operation is the most recent operation available to any currently reachable member of the replica set), that node will vote for itself. In some embodiments, the self-vote operation can be restricted to nodes that can communicate with a majority of nodes in the replica set.
Upon receipt of a vote message, a given node will determine if its data is fresher and if not, confirm the received vote, and if yes, respond to the vote message with a negative vote. The example process can be augmented, by including timeouts for sending vote messages. For example, after confirming a vote or electing itself, a node can be configured to respond negatively to all other vote messages for a period of time. In addition, the above process can be repeated until a node is elected. In some examples, tie resolution can include a random wait period and a new check for freshest data/maxapplied optime.
FIG. 5 is anexample process flow500 for consensus election with arbiter participation. Various steps in the example algorithm can be executed by individual nodes while participating in a consensus election protocol. Other steps of the process describe states associated with the replica set as individual nodes perform the described operation.Process500 begins at502 with the detection of a failure event. Failure events can be based on communication failures. For example, each node in the distributed database or in a replica set can be configured to provide periodic messages to all other known members of a replica set, indicating it is still up and reachable, known as a heartbeat. The absence of the heartbeat message permits identification of communication failures. Other examples include secondary nodes that receive error messages when attempting to query their primary nodes. Further, power failures and/or hardware failures on nodes can result in a failure event that triggers an election protocol at504. The first node to participate in the election process will not have received any vote messages from any other nodes at506 NO and will seek to elect itself at508.
For other nodes participating in the election, the node may506 YES or may not506 NO have received a message from other nodes requesting that the node confirm a received vote. If a vote is received506 YES, a node compares the election information of the received vote against its own values at510. If the node has greater election values, for example, a higher priority, fresher data, better location, size of hardware, etc., the node attempts to elect itself at508. In some embodiments, each node will evaluate itself on election protocol requirements. One example protocol includes a freshest data requirement. Another example of an election protocol requirement includes a freshest data and a location requirement (e.g., closest to failed primary). Another example of an election protocol requirement includes freshest data with greatest uptime.
According to some embodiments, the evaluation of better election information at512 can include evaluation of data available via arbiter operation logs. For example, a secondary node can be configured to evaluate its data against other voting peers based on whether the secondary node can achieve a freshest data value equal or better than another node by capturing operations from an arbiter operation log. If the secondary node can achieve the same or better freshest data value and include otherbetter evaluation criteria512 YES, the secondary node can elect itself at508 and update its data accordingly.
The systems that attempt to elect themselves and nodes that offer a confirmation will become part of a group of systems representing the group of nodes and an identification of a node that can take on the primary node role. Any node can enter a group either by electing itself at508 or by confirming a vote for another node at514. Once a majority of nodes are in agreement on a new primary, a quorum has been established at516. If agreement has not been reached516 NO, further evaluation of votes or attempts to self-elect continue viabranches506 YES or506 NO.
If for example at512 NO, it is determined that a node receiving a vote does not have election information greater than the received vote (either with or without arbiter participation), then the receiving node confirms the vote for the node with the best election information at514. If the receiving node hasbetter election information512 YES, the receiving node can vote for itself at508. Once the group of secondary nodes reaches a threshold value for the number of participating systems (i.e. a quorum is established at516), the node identified for primary by the majority of participating nodes is assigned the primary node role at518. In one embodiment, a threshold is set to require a majority of the nodes in the distributed database or replica set to agree on the next primary node. Other embodiments can use different threshold values.
The end result of the execution of individual election operations, for example, in process500 (e.g.,502-514) is the establishment of an agreement between at least a threshold number (e.g., a majority) of the responsible nodes on which node should be primary (e.g.,516 YES). The reaching of agreement by the threshold establishes a quorum of responsible nodes on which node should be elected primary. Last, once the quorum has been reached, clients are directed to the primary node in response to access requests. In some embodiments. administration operations can be executed so that routing processes and/or configuration processes identify the newly elected primary as the primary node for an active replica set (e.g., assigning the primary role at518).
Further, the calculation of election values can include execution of election information generation sub-process. Anexample process600 for determining election information is illustrated inFIG. 6.Process600 begins with a node determining its priority from its local database at602. In addition to priority value, a value associated with the node's last executed operation can be retrieved from the node's the local database at604. In the event of equal priority value, the node with the freshest data will be elected (i.e. the node with the better operation value).
In one example, the node with the smallest lag time from the former primary node will generate the highest election value. Other embodiments can resolve additional parameters in determining its election value. For example,606 YES, additional parameters can be included in the determination of a node's election information at610. In one embodiment, location of the node can be given a value depending on a preferred location and captured at608. For example, the greater the distance from a preferred location the smaller the location value assigned to that node.
In another embodiment, nodes within the same rack as the former primary node can be favored over other nodes in the replica set. In yet another embodiment, location values can depend on geographic position, and a node with a different location than the current primary node can be favored. In another embodiment, a hardware resource capability (e.g., disk size, RAM, etc.) of a node can be assigned a value in determining an overall election value. Communication history can also be factored into election information for a particular node. For example, historic communication stability can improve a determined election value, and conversely a history of communication failure can lower an election value. In one example, each of the parameters can have a value and the node with the greatest total value can be elected. In other examples, each of the parameters is weighted (e.g., freshest data can have a large value relative to location, hardware, etc.) to favor selected ones of the parameters. Giving a freshest data value the highest weighting (e.g., 0.6*“freshest data value”, 0.2*“location value”, 0.1*“hardware score”, 0.1*“uptime score”—sets total election value), reflects emphasis on freshest data, for example.
If election information has been received from another node, and for example, the present node has a lower priority value, and/or older data that cannot be made better or equal through an arbiter operation log606 NO, evaluation of the election information at612 triggers the present node to confirm the node with better election information at614 via612 NO. If no election information has been received, a node can skip evaluation at612 or determine that based on current information it has thebest information612 YES and aggregate election information for an attempt to self-elect at616. In one example, the election value can include priority, last operation time, location, and hardware configuration. Other embodiments can use different values, different combination, or subsets of the identified parameters and generate election information/election values including those parameters at610.
According to one embodiment, once a new primary system is elected, the replica set continues to respond to read and write operations normally. For clients with connections established to the former primary node, errors will be returned as the client attempts to perform operations against the former primary. The errors can be returned based on an inability to communicate if, for example, a communication failure caused a new primary to be elected. Errors will also be returned if the former primary itself failed. Additionally, errors will also be returned if the former primary has been re-established as a secondary node. In response to a write operation, a former primary responds with an error message indicating that it is not primary. In one embodiment, the former primary can also be configured to respond with the address of its current primary. In one alternative, a client can discover a new primary in response to the error message. A new primary may need to be discovered any time the primary node changes from one node to another in a replica set. Discovery can occur by connecting to the entire replica set, as one example. In one alternative, the node returning a not primary error message can be configured to identify the node it believes is primary and if the node returning the error message does not have the address of the primary yet, that state can be indicated in a returned error message. The return of additional information with the not primary error message can be limited to systems that had the primary node responsibility within a configurable amount of time from receiving the request.
The various functions, processes, and/or pseudo code described herein can be configured to be executed on the systems shown by way of example inFIGS. 1-3. The systems and/or system components shown can be specially configured to execute the processes and/or functions described. Various aspects and functions described herein, in accord with aspects of the present invention, may be implemented as specially configured hardware, software, or a combination of hardware and software on one or more specially configured computer systems. Additionally, aspects in accord with the present invention may be located on a single specially configured computer system or may be distributed among one or more specially configured computer systems connected to one or more communication networks.
For example, various aspects, components, and functions (e.g., shard, node, data router, application layer, replication component, election component, configuration component, etc.) may be distributed among one or more special purpose computer systems configured to provide a service to one or more client computers, mobile device, or to perform an overall task as part of a distributed system. Additionally, aspects may be performed on a client-server or multi-tier system that includes components or engines distributed among one or more server systems that perform various functions. Consequently, examples are not limited to executing on any particular system or group of systems. Further, aspects and functions may be implemented in software, hardware or firmware, or any combination thereof. Thus, aspects and functions may be implemented within methods, acts, systems, system elements and components using a variety of hardware and software configurations, and examples are not limited to any particular distributed architecture, network, or communication protocol.
Referring toFIG. 8, there is illustrated a block diagram of a distributed specialpurpose computer system800, in which various aspects and functions are practiced (e.g., including a replication component (e.g., captures executed write operations and distributes to nodes hosting the same copy of data), a configuration component (e.g., enables arbiter participation in either or both data commitment and primary election), an election component (e.g., triggers election protocols in response to primary failure), among other options). As shown, the distributedcomputer system800 includes one or more special purpose computer systems that exchange information. More specifically, the distributedcomputer system800 includescomputer systems802,804 and806. As shown, thecomputer systems802,804 and806 are interconnected by, and may exchange data through, acommunication network808. For example, a segment of a distributed database can be implemented on802, which can communicate with other systems (e.g.,804 and806), which host other or remaining portions of the database data, and or copies of the database data.
In some embodiments, thenetwork808 may include any communication network through which computer systems may exchange data. To exchange data using thenetwork808, thecomputer systems802,804 and806 and thenetwork808 may use various methods, protocols and standards, including, among others, TCP/IP, or other communication standard, and may include secure communication protocols VPN, IPsec, etc. To ensure data transfer is secure, thecomputer systems802,804 and806 may transmit data via thenetwork808 using a variety of security measures including, for example, TLS, SSL or VPN or other standard. While the distributedcomputer system800 illustrates three networked computer systems, the distributedcomputer system800 is not so limited and may include any number of computer systems and computing devices, networked using any medium and communication protocol.
As illustrated inFIG. 8, the specialpurpose computer system802 includes aprocessor810, amemory812, abus814, aninterface816 anddata storage818 and further includes any one or more of the component discussed above to implement at least some of the aspects, functions and processes disclosed herein, as either a stand-alone system or part of a distributed system. In some embodiments, theprocessor810 performs a series of instructions that result in manipulated data. Theprocessor810 may be any type of processor, multiprocessor or controller. Theprocessor810 is connected to other system components, including one ormore memory devices812, by thebus814.
Thememory812 stores programs and data during operation of thecomputer system802. Thus, thememory812 may be a relatively high performance, volatile, random access memory such as a dynamic random access memory (DRAM) or static memory (SRAM) or other standard. However, thememory812 may include any device for storing data, such as a disk drive, hard drive, or other non-volatile storage device. Various examples may organize thememory812 into particularized and, in some cases, unique structures to perform the functions disclosed herein. These data structures may be sized and organized to store values for particular to specific database architectures and specific data types, and in particular, may include standardize formats for organizing and managing data storage.
Components of thecomputer system802 are coupled by an interconnection element such as thebus814. Thebus814 may include one or more physical busses, for example, busses between components that are integrated within the same machine, but may include any communication coupling between system elements including specialized or standard computing bus technologies such as IDE, SCSI, PCI and InfiniBand or other standard. Thebus814 enables communications, such as data and instructions, to be exchanged between system components of thecomputer system802.
Thecomputer system802 also includes one ormore interface devices816 such as input devices, output devices and combination input/output devices. Interface devices may receive input or provide output. More particularly, output devices may render information for external presentation. Input devices may accept information from external sources. Examples of interface devices include keyboards, mouse devices, microphones, touch screens, printing devices, display screens, speakers, network interface cards, etc. Interface devices allow thecomputer system802 to exchange information and to communicate with external entities, such as users, vendors, and other systems.
Thedata storage818 includes a computer readable and writeable nonvolatile, or non-transitory, data storage medium in which instructions are stored that define a program or other object that is executed by theprocessor810. Thedata storage818 also may include information that is recorded, on or in, the medium, and that is processed by theprocessor810 during execution of the program. More specifically, the information may be stored in one or more data structures specifically configured to conserve storage space or increase data exchange performance.
The instructions stored in the data storage may be persistently stored as encoded signals, and the instructions may cause theprocessor810 to perform any of the functions described herein. The medium may be, for example, optical disk, magnetic disk or flash memory, among other options. In operation, theprocessor810 or some other controller causes data to be read from the nonvolatile recording medium into another memory, such as thememory812, that allows for faster access to the information by theprocessor810 than does the storage medium included in thedata storage818. The memory may be located in thedata storage818 or in thememory812, however, theprocessor810 manipulates the data within the memory, and then copies the data to the storage medium associated with thedata storage818 after processing is completed. A variety of components may manage data movement between the storage medium and other memory elements and examples are not limited to particular data management components. Further, examples are not limited to a particular memory system or data storage system.
Although thecomputer system802 is shown by way of example as one type of computer system upon which various aspects and functions may be practiced, aspects and functions are not limited to being implemented on thecomputer system802 as shown inFIG. 8. Various aspects and functions may be practiced on one or more specially configured computers having different architectures or components than that shown inFIG. 8 which can be modified to include the specially purpose components and/or functions discussed. For instance, thecomputer system802 may include specially programmed, special-purpose hardware, such as an application-specific integrated circuit (ASIC) tailored to perform any one or more operations disclosed herein (e.g., validating received operations, routing write operations, replicating operations, among other examples). While another example may perform the same function(s) using a grid of several computing devices running MAC OS System X with Motorola PowerPC processors and several specialized computing devices running proprietary hardware and operating systems.
Thecomputer system802 may be a computer system including an operating system that manages at least a portion of the hardware elements included in thecomputer system802. Additionally, various aspects and functions may be implemented in a non-programmed environment, for example, documents created in HTML, XML or other format that, when viewed in a window of a browser program, can render aspects of a graphical-user interface or perform other functions.
According to one embodiment, a distributed database can include one or more data routers for managing distributed databases. The one or more data routers can receive client request (e.g., user entered data requests, data requests received from an application programming interface (API), or other computing entity requests) and route requests to appropriate servers, systems, or nodes within the distributed database. In some embodiments, one or more data routers can be configured to communicate replication operations to arbiter nodes based on configurations of the distributed database. In other embodiments, the data routers can deliver requests to local entities (e.g., a replica set) which can distribute operations (e.g., including write operations) to any member of the replica set including any arbiters.
Further, various examples may be implemented as programmed or non-programmed elements, or any combination thereof. For example, a web page may be implemented using HTML while a data object called from within the web page may be written in C++. Thus, the examples are not limited to a specific programming language and any suitable programming language could be used. Accordingly, the functional components disclosed herein may include a wide variety of elements, e.g., specialized hardware, executable code, data structures or data objects, that are configured to perform the functions described herein.
Having thus described several aspects of at least one example, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art. For instance, examples disclosed herein may also be used in other contexts. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the scope of the examples discussed herein. Accordingly, the foregoing description and drawings are by way of example only.
Use of ordinal terms such as “first,” “second,” “third,” “a,” “b,” “c,” etc., in the claims to modify or otherwise identify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.

Claims (21)

What is claimed is:
1. A computer implemented method for managing a distributed database, the method comprising:
establishing at least one primary node within a plurality of nodes, wherein the plurality of nodes comprise a distributed database system and the distributed database system provides responses to database requests from database clients;
establishing at least one secondary node configured to host a replica of data of the at least one primary node from the plurality of nodes and update the replica responsive to received replicated operations from the at least one primary node;
establishing at least one arbiter node configured to host an operation log of operations executed by the at least one primary node, wherein the at least one arbiter node does not host a replica of the data; and
electing a new primary node responsive to detecting a failure of the at least one primary node, wherein electing the new primary node includes:
executing a consensus protocol to elect one of the at least one secondary node as the new primary node, wherein the consensus protocol includes participation of the at least one secondary node and the at least one arbiter node associated with the failed at least one primary node;
evaluating election criteria at the at least one secondary node during an election period;
communicating by a respective one of the at least one secondary node i) a self-vote responsive to determining that the respective one of the at least one secondary node meets the election criteria, or ii) a confirmation of a received vote of another secondary node responsive to determining that the another secondary node is more suitable to be elected as the new primary node based on the election criteria; and
evaluating by the respective one of the at least one secondary node the operation log of the at least one arbiter node as part of the acts of determining.
2. The method according toclaim 1, further comprising restricting processing of write operations, received from the database clients, to the at least one primary node.
3. The method according toclaim 1, further comprising an act of communicating the self-vote responsive to querying the operation log of the at least arbiter node and determining that the operation log of the at least arbiter node is more up to date than operation logs of one or more of the at least one secondary node based on received election information.
4. The method according toclaim 1, wherein the election criteria include at least one of, or any combination of, or two or more of: a most recent data requirement, a location requirement, a hardware requirement, and an uptime requirement.
5. The method according toclaim 3, wherein responsive to the respective one of the at least one secondary node determining that the operation log of the at least arbiter node includes the most up to date data, triggering an update of a replica of the data hosted by the respective one of the at least one secondary node.
6. The method according toclaim 1, further comprising an act of committing write operations responsive to safe write requests received from the database clients, wherein committing the write operations includes counting an act of logging the write operations at the at least one arbiter node towards any commitment requirements.
7. The method according toclaim 1, wherein the at least one arbiter node is configured to limit participation in replication operations at the at least one arbiter node to updating the operation log without replication of the operations on respective data.
8. The method according toclaim 1, wherein the at least one arbiter node is configured to provide operations stored in the operation log to one or more of the at least one secondary node.
9. A distributed database system, the system comprising:
at least one processor operatively connected to a memory, wherein the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise:
a configuration component configured to establish a role associated with each node in a plurality of nodes, wherein the configuration component is configured to establish at least one primary node with a primary role, at least one secondary node with a secondary role, and at least one arbiter node with an arbiter role;
the at least one primary node configured to:
host data;
execute write operations on at least a portion of the data and generate at least one log entry for execution of the write operations;
replicate the at least one log entry to the at least one secondary node and the at least one arbiter node;
the at least one secondary node configured to:
host a replica of the data of the at least one primary node;
execute the at least one log entry received from the at least one primary node to update a respective copy of the at least a portion of the data;
the at least one arbiter node configured to update an operation log of operations responsive to receipt of the at least one log entry for the write operations from the at least one primary node, wherein the at least one arbiter node does not host a replica of the data;
an election component configured to elect a new primary node responsive to detecting a failure of the at least one primary node, wherein the election component is further configured to:
execute a consensus protocol to elect one of the at least one secondary node as the new primary node, wherein the consensus protocol includes participation of the at least one secondary node and the at least one arbiter node associated with the failed at least one primary node;
evaluate election criteria at the at least one secondary node during an election period;
communicate i) a self-vote for a respective one of the at least one secondary node responsive to determining that the respective one of the at least one secondary node meets the election criteria, or ii) a confirmation of a received vote of another secondary node responsive to determining that the another secondary node is more suitable to be elected as the new primary node based on the election criteria; and
evaluate the operation log of the at least one arbiter node as part of the acts of determining.
10. The system ofclaim 9, further comprising a replication component configured to restrict the write operations received from client computer systems to the at least one primary node having the primary role.
11. The system according toclaim 9, wherein the election component is distributed through one or more nodes comprising the distributed database system.
12. The system according toclaim 11, wherein the election component is distributed at least through the at least one secondary node and the at least one arbiter node.
13. A computer implemented method for managing a distributed database, the method comprising:
establishing at least one primary node configured to host data;
establishing at least one secondary node configured to host a replica of the data hosted by the at least one primary node;
establishing at least one arbiter node, the at least one arbiter node configured to host an operation log of operations executed by the at least one primary node, wherein the at least one arbiter node does not host a replica of the data hosted by the at least one primary node or the at least one secondary node;
electing a new primary node responsive to detecting a failure of the at least one primary node, wherein electing the new primary node includes:
executing a consensus protocol to elect one of the at least one secondary node as the new primary node, wherein the consensus protocol includes participation of the at least one secondary node and the at least one arbiter node associated with the failed at least one primary node;
evaluating election criteria at the at least one secondary node during an election period;
communicating by a respective one of the at least one secondary node i) a self-vote responsive to determining that the respective one of the at least one secondary node meets the election criteria, or ii) a confirmation of a received vote of another secondary node responsive to determining that the another secondary node is more suitable to be elected as the new primary node based on the election criteria; and
evaluating, by the respective one of the at least one secondary node, the operation log of the at least one arbiter node as part of the acts of determining.
14. The method according toclaim 13, further comprising an act of communicating the self-vote responsive to querying the operation log of the at least arbiter node and determining that the operation log of the at least arbiter node is more up to date than operation logs of one or more of the at least one secondary node based on received election information.
15. The method according toclaim 13, wherein the election criteria include at least one of, or any combination of, or two or more of: a most recent data requirement, a location requirement, a hardware requirement, and an uptime requirement.
16. The method according toclaim 14, wherein responsive to the respective one of the at least one secondary node determining that the operation log of the at least arbiter node includes the most up to date data, triggering an update of a replica of the data hosted by the respective one of the at least one secondary node replica.
17. The method according toclaim 13, further comprising an act of committing write operations responsive to safe write requests received from database clients, wherein committing the write operations includes counting an act of logging the write operations at the at least one arbiter node towards any commitment requirements.
18. A distributed database system, the system comprising:
at least one processor operatively connected to a memory, wherein the at least one processor when running is configured to execute a plurality of system components, wherein the plurality of system components comprise:
a configuration component configured to:
establish at least one primary node configured to host data;
establish at least one secondary node configured to host a replica of the data of the at least one primary node; and
establish at least one arbiter node, the at least one arbiter node configured to host an operation log of operations executed by the at least one primary node, wherein the at least one arbiter node does not host a replica of the data;
an election component configured to elect a new primary node responsive to detecting a failure of the at least one primary node, wherein the election component is further configured to:
execute a consensus protocol to elect one of the at least one secondary node as the new primary node, wherein the consensus protocol includes participation of the at least one secondary node and the at least one arbiter node;
evaluate election criteria at the at least one secondary node during an election period;
communicate i) a self-vote for a respective one of the at least one secondary node responsive to determining that the respective one of the at least one secondary node meets the election criteria, or ii) communicate a confirmation of a received vote of another secondary node responsive to determining that the another secondary node is more suitable to be elected as the new primary node based on the election criteria; and
evaluate the operation log of the at least one arbiter node as part of the acts of determining.
19. The system ofclaim 18, further comprising a replication component configured to restrict write operations received from client computer systems to the at least one primary node having a primary role.
20. The system according toclaim 18, wherein the election component is distributed through one or more nodes comprising the distributed database system.
21. The system according toclaim 20, wherein the election component is distributed at least through the at least one secondary node and the at least one arbiter node.
US15/200,9752015-07-022016-07-01System and method for augmenting consensus election in a distributed databaseActive2038-02-15US10713275B2 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US15/200,975US10713275B2 (en)2015-07-022016-07-01System and method for augmenting consensus election in a distributed database

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
US201562188097P2015-07-022015-07-02
US15/200,975US10713275B2 (en)2015-07-022016-07-01System and method for augmenting consensus election in a distributed database

Publications (2)

Publication NumberPublication Date
US20170032007A1 US20170032007A1 (en)2017-02-02
US10713275B2true US10713275B2 (en)2020-07-14

Family

ID=57882775

Family Applications (2)

Application NumberTitlePriority DateFiling Date
US15/200,721Active2038-01-26US10496669B2 (en)2015-07-022016-07-01System and method for augmenting consensus election in a distributed database
US15/200,975Active2038-02-15US10713275B2 (en)2015-07-022016-07-01System and method for augmenting consensus election in a distributed database

Family Applications Before (1)

Application NumberTitlePriority DateFiling Date
US15/200,721Active2038-01-26US10496669B2 (en)2015-07-022016-07-01System and method for augmenting consensus election in a distributed database

Country Status (1)

CountryLink
US (2)US10496669B2 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10846305B2 (en)2010-12-232020-11-24Mongodb, Inc.Large distributed database clustering systems and methods
US10866868B2 (en)2017-06-202020-12-15Mongodb, Inc.Systems and methods for optimization of database operations
US10872095B2 (en)2012-07-262020-12-22Mongodb, Inc.Aggregation framework system architecture and method
US10977277B2 (en)2010-12-232021-04-13Mongodb, Inc.Systems and methods for database zone sharding and API integration
US10990590B2 (en)2012-07-262021-04-27Mongodb, Inc.Aggregation framework system architecture and method
US10997211B2 (en)2010-12-232021-05-04Mongodb, Inc.Systems and methods for database zone sharding and API integration
US11222043B2 (en)2010-12-232022-01-11Mongodb, Inc.System and method for determining consensus within a distributed database
US11288282B2 (en)2015-09-252022-03-29Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US11394532B2 (en)2015-09-252022-07-19Mongodb, Inc.Systems and methods for hierarchical key management in encrypted distributed databases
US11403317B2 (en)2012-07-262022-08-02Mongodb, Inc.Aggregation framework system architecture and method
US11438249B2 (en)2018-10-082022-09-06Alibaba Group Holding LimitedCluster management method, apparatus and system
US11481289B2 (en)2016-05-312022-10-25Mongodb, Inc.Method and apparatus for reading and writing committed data
US11520670B2 (en)2016-06-272022-12-06Mongodb, Inc.Method and apparatus for restoring data from snapshots
US11544288B2 (en)2010-12-232023-01-03Mongodb, Inc.Systems and methods for managing distributed database deployments
US11544284B2 (en)2012-07-262023-01-03Mongodb, Inc.Aggregation framework system architecture and method
US11615115B2 (en)2010-12-232023-03-28Mongodb, Inc.Systems and methods for managing distributed database deployments

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9740762B2 (en)2011-04-012017-08-22Mongodb, Inc.System and method for optimizing data migration in a partitioned database
US9881034B2 (en)2015-12-152018-01-30Mongodb, Inc.Systems and methods for automating management of distributed databases
US8572031B2 (en)2010-12-232013-10-29Mongodb, Inc.Method and apparatus for maintaining replica sets
US10366100B2 (en)2012-07-262019-07-30Mongodb, Inc.Aggregation framework system architecture and method
US10614098B2 (en)2010-12-232020-04-07Mongodb, Inc.System and method for determining consensus within a distributed database
US10713280B2 (en)2010-12-232020-07-14Mongodb, Inc.Systems and methods for managing distributed database deployments
US10740353B2 (en)2010-12-232020-08-11Mongodb, Inc.Systems and methods for managing distributed database deployments
US10496669B2 (en)2015-07-022019-12-03Mongodb, Inc.System and method for augmenting consensus election in a distributed database
US10409807B1 (en)*2015-09-232019-09-10Striim, Inc.Apparatus and method for data replication monitoring with streamed data updates
US10394822B2 (en)2015-09-252019-08-27Mongodb, Inc.Systems and methods for data conversion and comparison
US10423626B2 (en)2015-09-252019-09-24Mongodb, Inc.Systems and methods for data conversion and comparison
US10846411B2 (en)2015-09-252020-11-24Mongodb, Inc.Distributed database systems and methods with encrypted storage engines
US10740733B2 (en)2017-05-252020-08-11Oracle International CorporatonSharded permissioned distributed ledgers
US10481803B2 (en)*2017-06-162019-11-19Western Digital Technologies, Inc.Low write overhead consensus protocol for distributed storage
US10289489B2 (en)2017-07-112019-05-14Western Digital Technologies, Inc.Update efficient consensus protocols for erasure coded data stores
CN109729129B (en)*2017-10-312021-10-26华为技术有限公司Configuration modification method of storage cluster system, storage cluster and computer system
US11061924B2 (en)*2017-11-222021-07-13Amazon Technologies, Inc.Multi-region, multi-master replication of database tables
US10664465B2 (en)*2018-04-032020-05-26Sap SeDatabase change capture with transaction-consistent order
US10838977B2 (en)2018-06-222020-11-17Ebay Inc.Key-value replication with consensus protocol
US11106541B2 (en)*2019-03-152021-08-31Huawei Technologies Co., LtdSystem and method for replicating data in distributed database systems
CN110769028B (en)*2019-09-102022-04-15陕西优米数据技术股份有限公司Pattern authorization consensus system and method based on block chain technology
CN110674215A (en)*2019-09-172020-01-10郑州阿帕斯科技有限公司Master selection method and device for distributed system and distributed system
US11372818B2 (en)2020-01-072022-06-28Samsung Electronics Co., Ltd.Scaling performance for large scale replica sets for a strongly consistent distributed system
US11727034B2 (en)2020-06-082023-08-15Mongodb, Inc.Cross-cloud deployments
CN112084171B (en)*2020-08-142024-04-12浪潮思科网络科技有限公司Operation log writing method, device, equipment and medium based on Cassandra database
US11494356B2 (en)2020-09-232022-11-08Salesforce.Com, Inc.Key permission distribution
CN113127565B (en)*2021-04-282025-06-03联通沃音乐文化有限公司 Method and device for synchronizing distributed database nodes based on external observer group
US11687561B2 (en)*2021-08-112023-06-27Capital One Services, LlcSystems and methods for cross-region data processing
US11438224B1 (en)2022-01-142022-09-06Bank Of America CorporationSystems and methods for synchronizing configurations across multiple computing clusters
CN115168322A (en)*2022-07-082022-10-11北京奥星贝斯科技有限公司Database system, main library election method and device
US11909404B1 (en)*2022-12-122024-02-20Advanced Micro Devices, Inc.Delay-locked loop offset calibration and correction
CN116016521A (en)*2023-01-042023-04-25中国建设银行股份有限公司 Network master node election method, device, equipment and storage medium
CN115811520B (en)*2023-02-082023-04-07天翼云科技有限公司 Election method, device and electronic equipment for master node in distributed system
CN119474203A (en)*2024-10-302025-02-18浪潮金融信息技术有限公司 A method, system, medium and device for storing logical information of yarn box of self-service equipment
CN119128018B (en)*2024-11-082025-03-11宁波菊风系统软件有限公司 Master-slave node election method, device, equipment and storage medium in distributed system

Citations (258)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4918593A (en)1987-01-081990-04-17Wang Laboratories, Inc.Relational database system
US5379419A (en)1990-12-071995-01-03Digital Equipment CorporationMethods and apparatus for accesssing non-relational data files using relational queries
US5416917A (en)1990-03-271995-05-16International Business Machines CorporationHeterogenous database communication system in which communicating systems identify themselves and convert any requests/responses into their own data format
US5471629A (en)1988-12-191995-11-28Hewlett-Packard CompanyMethod of monitoring changes in an object-oriented database with tuned monitors
US5551027A (en)1993-01-071996-08-27International Business Machines CorporationMulti-tiered indexing method for partitioned data
US5598559A (en)1994-07-011997-01-28Hewlett-Packard CompanyMethod and apparatus for optimizing queries having group-by operators
US5710915A (en)1995-12-211998-01-20Electronic Data Systems CorporationMethod for accelerating access to a database clustered partitioning
US5884299A (en)1997-02-061999-03-16Ncr CorporationOptimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations
US5999179A (en)1997-11-171999-12-07Fujitsu LimitedPlatform independent computer network management client
US6065017A (en)1997-12-312000-05-16Novell, Inc.Apparatus and method for identifying and recovering from database errors
US6088524A (en)1995-12-272000-07-11Lucent Technologies, Inc.Method and apparatus for optimizing database queries involving aggregation predicates
US6112201A (en)1995-08-292000-08-29Oracle CorporationVirtual bookshelf
US6115705A (en)1997-05-192000-09-05Microsoft CorporationRelational database system and method for query processing using early aggregation
US6240406B1 (en)1998-08-312001-05-29The Trustees Of The University Of PennsylvaniaSystem and method for optimizing queries with aggregates and collection conversions
US6240514B1 (en)1996-10-182001-05-29Kabushiki Kaisha ToshibaPacket processing device and mobile computer with reduced packet processing overhead
US6249866B1 (en)1997-09-162001-06-19Microsoft CorporationEncrypting file system and method
US20010021929A1 (en)2000-02-212001-09-13International Business Machines CorporationUser-oriented method and system for database query
US6324540B1 (en)1998-02-232001-11-27Lucent Technologies Inc.Apparatus and method for efficiently partitioning a weighted array
US6324654B1 (en)1998-03-302001-11-27Legato Systems, Inc.Computer network remote data mirroring system
US6339770B1 (en)1999-08-122002-01-15International Business Machines CorporationQuery simplification and optimization involving eliminating grouping column from group by operation corresponds to group by item that is constant
US6351742B1 (en)1999-03-182002-02-26Oracle CorporationMethod and mechanism for database statement optimization
US20020029207A1 (en)2000-02-282002-03-07Hyperroll, Inc.Data aggregation server for managing a multi-dimensional database and database management system having data aggregation server integrated therein
US6363389B1 (en)1998-09-242002-03-26International Business Machines CorporationTechnique for creating a unique quasi-random row identifier
US6385201B1 (en)1997-04-302002-05-07Nec CorporationTopology aggregation using parameter obtained by internodal negotiation
US6385604B1 (en)1999-08-042002-05-07Hyperroll, Israel LimitedRelational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US20020065675A1 (en)2000-11-272002-05-30Grainger Jeffry J.Computer implemented method of managing information disclosure statements
US20020065677A1 (en)2000-11-272002-05-30First To File, Inc.Computer implemented method of managing information disclosure statements
US20020065676A1 (en)2000-11-272002-05-30First To File, Inc.Computer implemented method of generating information disclosure statements
US20020143901A1 (en)2001-04-032002-10-03Gtech Rhode Island CorporationInteractive media response processing system
US20020147842A1 (en)2001-02-012002-10-10Breitbart Yuri J.System and method for optimizing open shortest path first aggregates and autonomous network domain incorporating the same
US20020184239A1 (en)2001-06-012002-12-05Malcolm MosherSystem and method for replication of distributed databases that span multiple primary nodes
US6496843B1 (en)1999-03-312002-12-17Verizon Laboratories Inc.Generic object for rapid integration of data changes
US6505187B1 (en)1999-12-082003-01-07Ncr CorporationComputing multiple order-based functions in a parallel processing database system
US20030046307A1 (en)1997-06-022003-03-06Rivette Kevin G.Using hyperbolic trees to visualize data generated by patent-centric and group-oriented data processing
US20030084073A1 (en)2001-11-012003-05-01Timo HottiMethod and arrangement for providing an audit of a replica database
US20030088659A1 (en)2001-11-082003-05-08Susarla Hanumantha RaoSystem and method for distributed state management
US6611850B1 (en)1997-08-262003-08-26Reliatech Ltd.Method and control apparatus for file backup and restoration
US20030182427A1 (en)2002-02-212003-09-25Halpern Eric M.Systems and methods for automated service migration
US20030187864A1 (en)2002-04-022003-10-02Mcgoveran David O.Accessing and updating views and relations in a relational database
US6687846B1 (en)2000-03-302004-02-03Intel CorporationSystem and method for error handling and recovery
US6691101B2 (en)2001-06-212004-02-10Sybase, Inc.Database system providing optimization of group by operator over a union all
US20040078569A1 (en)2002-10-212004-04-22Timo HottiMethod and system for managing security material and sevices in a distributed database system
US20040133591A1 (en)*2001-03-162004-07-08Iti, Inc.Asynchronous coordinated commit replication and dual write with replication transmission and locking of target database on updates only
US20040168084A1 (en)2003-02-202004-08-26Bea Systems, Inc.Federated management of content repositories
US20040186826A1 (en)2003-03-212004-09-23International Business Machines CorporationReal-time aggregation of unstructured data into structured data for SQL processing by a relational database engine
US20040186817A1 (en)2001-10-312004-09-23Thames Joseph M.Computer-based structures and methods for generating, maintaining, and modifying a source document and related documentation
US6801905B2 (en)2002-03-062004-10-05Sybase, Inc.Database system providing methodology for property enforcement
US20040205048A1 (en)2003-03-282004-10-14Pizzo Michael J.Systems and methods for requesting and receiving database change notifications
US6823474B2 (en)2000-05-022004-11-23Sun Microsystems, Inc.Method and system for providing cluster replicated checkpoint services
US20040236743A1 (en)2003-05-232004-11-25Bmc Software, Inc.Database reorganization technique
US20040254919A1 (en)2003-06-132004-12-16Microsoft CorporationLog parser
US20050027796A1 (en)1995-06-072005-02-03Microsoft CorporationDirectory service for a computer network
US20050033756A1 (en)2003-04-032005-02-10Rajiv KottomtharayilSystem and method for dynamically sharing storage volumes in a computer network
US20050038833A1 (en)2003-08-142005-02-17Oracle International CorporationManaging workload by service
US6920460B1 (en)2002-05-292005-07-19Oracle International CorporationSystems and methods for managing partitioned indexes that are created and maintained by user-defined indexing schemes
US20050192921A1 (en)2004-02-262005-09-01Microsoft CorporationDatabase monitoring system
US20050234841A1 (en)2004-03-302005-10-20Bingjie MiaoGroup-By size result estimation
US6959369B1 (en)2003-03-062005-10-25International Business Machines CorporationMethod, system, and program for data backup
US20050283457A1 (en)2004-06-212005-12-22Microsoft CorporationAPI for programmatic retrieval and replay of database trace
US20060004746A1 (en)1998-09-042006-01-05Kalido LimitedData processing system
US20060020586A1 (en)2000-03-032006-01-26Michel PromptSystem and method for providing access to databases via directories and other hierarchical structures and interfaces
US7020649B2 (en)2002-12-302006-03-28International Business Machines CorporationSystem and method for incrementally maintaining non-distributive aggregate functions in a relational database
US7032089B1 (en)2003-06-092006-04-18Veritas Operating CorporationReplica synchronization using copy-on-read technique
US20060085541A1 (en)2004-10-192006-04-20International Business Machines CorporationFacilitating optimization of response time in computer networks
US20060090095A1 (en)1999-03-262006-04-27Microsoft CorporationConsistent cluster operational data in a server cluster using a quorum of replicas
US20060168154A1 (en)2004-11-192006-07-27Microsoft CorporationSystem and method for a distributed object store
US20060209782A1 (en)2005-01-282006-09-21Charles MillerBandwidth optimization system
US20060218123A1 (en)2005-03-282006-09-28Sybase, Inc.System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning
US20060235905A1 (en)2005-04-142006-10-19Rajesh KapurMethod and system for preserving real-time access to a system in case of a disaster
US20060288232A1 (en)2005-06-162006-12-21Min-Hank HoMethod and apparatus for using an external security device to secure data in a database
US20060294129A1 (en)2005-06-272006-12-28Stanfill Craig WAggregating data with complex operations
US7181460B2 (en)2002-06-182007-02-20International Business Machines CorporationUser-defined aggregate functions in database systems without native support
US20070050436A1 (en)2005-08-262007-03-01International Business Machines CorporationOrder-preserving encoding formats of floating-point decimal numbers for efficient value comparison
US7191299B1 (en)2003-05-122007-03-13Veritas Operating CorporationMethod and system of providing periodic replication
US20070061487A1 (en)2005-02-012007-03-15Moore James FSystems and methods for use of structured and unstructured distributed data
US20070094237A1 (en)2004-12-302007-04-26Ncr CorporationMultiple active database systems
US7246345B1 (en)2001-04-022007-07-17Sun Microsystems, Inc.Method and apparatus for partitioning of managed state for a Java based application
US20070203944A1 (en)2006-02-282007-08-30International Business Machines CorporationWeb services database cluster architecture
US20070226640A1 (en)2000-11-152007-09-27Holbrook David MApparatus and methods for organizing and/or presenting data
US20070233746A1 (en)2006-03-302007-10-04Garbow Zachary ATransitioning of Database Service Responsibility Responsive to Server Failure in a Partially Clustered Computing Environment
US20070240129A1 (en)2006-04-062007-10-11Klaus KretzschmarSortable floating point numbers
US20080002741A1 (en)2006-06-302008-01-03Nokia CorporationApparatus, method and computer program product providing optimized location update for mobile station in a relay-based network
US20080016021A1 (en)2006-07-112008-01-17Dell Products, LpSystem and method of dynamically changing file representations
US20080071755A1 (en)2006-08-312008-03-20Barsness Eric LRe-allocation of resources for query execution in partitions
US20080098041A1 (en)2006-10-202008-04-24Lakshminarayanan ChidambaranServer supporting a consistent client-side cache
US20080140971A1 (en)2006-11-222008-06-12Transitive LimitedMemory consistency protection in a multiprocessor computing system
US20080162590A1 (en)2007-01-032008-07-03Oracle International CorporationMethod and apparatus for data rollback
US7447807B1 (en)2006-06-302008-11-04Siliconsystems, Inc.Systems and methods for storing data in segments of a storage subsystem
US20080288646A1 (en)2006-11-092008-11-20Microsoft CorporationData consistency within a federation infrastructure
US7467103B1 (en)2002-04-172008-12-16Murray Joseph LOptimization system and method for buying clubs
US7469253B2 (en)2003-04-012008-12-23Microsoft CorporationAssociative hash partitioning using pseudo-random number generator
US7472117B2 (en)2003-11-262008-12-30International Business Machines CorporationAbstract query building with selectability of aggregation operations and grouping
US20090030986A1 (en)2007-07-272009-01-29Twinstrata, Inc.System and method for remote asynchronous data replication
US7486661B2 (en)2001-11-092009-02-03AlcatelMethod for optimizing the use of network resources for the transmission of data signals, such as voice, over an IP-packet supporting network
US20090055350A1 (en)2007-08-242009-02-26International Business Machines CorporationAggregate query optimization
US20090077010A1 (en)2007-09-142009-03-19Brian Robert MurasOptimization of Database Queries Including Grouped Aggregation Functions
US20090094318A1 (en)2005-09-302009-04-09Gladwin S ChristopherSmart access to a dispersed data storage network
US7548928B1 (en)2005-08-052009-06-16Google Inc.Data compression of large scale data stored in sparse tables
US7552356B1 (en)2004-06-302009-06-23Sun Microsystems, Inc.Distributed data storage system for fixed content
US7558481B2 (en)2005-01-242009-07-07Tellabs Operations, Inc.Method for optimizing enhanced DWDM networks
US7567991B2 (en)2003-06-252009-07-28Emc CorporationReplication of snapshot using a file system copy differential
US20090222474A1 (en)2008-02-292009-09-03Alpern Bowen LMethod and system for using overlay manifests to encode differences between virtual machine images
US20090240744A1 (en)2008-03-212009-09-24Qualcomm IncorporatedPourover journaling
US20090271412A1 (en)2008-04-292009-10-29Maxiscale, Inc.Peer-to-Peer Redundant File Server System and Methods
US7617369B1 (en)2003-06-302009-11-10Symantec Operating CorporationFast failover with multiple secondary nodes
US7634459B1 (en)2006-11-162009-12-15Precise Software Solutions Ltd.Apparatus, method and computer-code for detecting changes in database-statement execution paths
US7647329B1 (en)2005-12-292010-01-12Amazon Technologies, Inc.Keymap service architecture for a distributed storage system
US20100011026A1 (en)2008-07-102010-01-14International Business Machines CorporationMethod and system for dynamically collecting data for checkpoint tuning and reduce recovery time
US7657578B1 (en)2004-12-202010-02-02Symantec Operating CorporationSystem and method for volume replication in a storage environment employing distributed block virtualization
US7657570B2 (en)2002-12-192010-02-02International Business Machines CorporationOptimizing aggregate processing
US20100030800A1 (en)2008-08-012010-02-04International Business Machines CorporationMethod and Apparatus for Generating Partitioning Keys for a Range-Partitioned Database
US20100030793A1 (en)2008-07-312010-02-04Yahoo! Inc.System and method for loading records into a partitioned database table
US7668801B1 (en)2003-04-212010-02-23At&T Corp.Method and apparatus for optimizing queries under parametric aggregation constraints
US20100049717A1 (en)2008-08-202010-02-25Ryan Michael FMethod and systems for sychronization of process control servers
US20100058010A1 (en)2008-09-042010-03-04Oliver AugensteinIncremental backup using snapshot delta views
US20100106934A1 (en)2008-10-242010-04-29Microsoft CorporationPartition management in a partitioned, scalable, and available structured storage
US20100161492A1 (en)2008-04-142010-06-24Tra, Inc.Analyzing return on investment of advertising campaigns using cross-correlation of multiple data sources
US7761465B1 (en)1999-09-172010-07-20Sony CorporationData providing system and method therefor
US20100198791A1 (en)2009-02-052010-08-05Grace Zhanglei WuSystem, method, and computer program product for allowing access to backup data
US20100205028A1 (en)2000-08-142010-08-123M Innovative Properties CompanyData processing system
US20100223078A1 (en)2008-06-102010-09-02Dale WillisCustomizable insurance system
US20100235606A1 (en)2009-03-112010-09-16Oracle America, Inc.Composite hash and list partitioning of database tables
US20100250930A1 (en)2007-05-092010-09-30Andras CsaszarMethod and apparatus for protecting the routing of data packets
US20100333116A1 (en)2009-06-302010-12-30Anand PrahladCloud gateway system for managing data storage to cloud storage sites
US20100333111A1 (en)2009-06-292010-12-30Software AgSystems and/or methods for policy-based JMS broker clustering
US20110022642A1 (en)2009-07-242011-01-27Demilo DavidPolicy driven cloud storage management and cloud storage policy router
US20110125704A1 (en)2009-11-232011-05-26Olga MordvinovaReplica placement strategy for distributed data persistence
US20110125766A1 (en)2009-11-232011-05-26Bank Of America CorporationOptimizing the Efficiency of an Image Retrieval System
US20110125894A1 (en)2009-11-252011-05-26Novell, Inc.System and method for intelligent workload management
US7957284B2 (en)2009-04-012011-06-07Fujitsu LimitedSystem and method for optimizing network bandwidth usage
US20110138148A1 (en)2009-12-042011-06-09David FriedmanDynamic Data Storage Repartitioning
US7962458B2 (en)2008-06-122011-06-14Gravic, Inc.Method for replicating explicit locks in a data replication engine
US20110202792A1 (en)2008-10-272011-08-18Kaminario Technologies Ltd.System and Methods for RAID Writing and Asynchronous Parity Computation
US8005868B2 (en)2008-03-072011-08-23International Business Machines CorporationSystem and method for multiple distinct aggregate queries
US8005804B2 (en)2007-03-292011-08-23Redknee Inc.Method and apparatus for adding a database partition
US20110225123A1 (en)2005-08-232011-09-15D Souza Roy PMulti-dimensional surrogates for data management
US20110225122A1 (en)2010-03-152011-09-15Microsoft CorporationReorganization of data under continuous workload
US20110231447A1 (en)2010-03-182011-09-22Nimbusdb Inc.Database Management System
US20110246717A1 (en)2010-03-312011-10-06Fujitsu LimitedStorage apparatus, recording medium and method for controlling storage apparatus
US8037059B2 (en)2008-05-012011-10-11International Business Machines CorporationImplementing aggregation combination using aggregate depth lists and cube aggregation conversion to rollup aggregation for optimizing query processing
US8082265B2 (en)2008-03-272011-12-20International Business Machines CorporationDatabase query processing with dynamically activated buffers
US8086597B2 (en)2007-06-282011-12-27International Business Machines CorporationBetween matching
US8099572B1 (en)2008-09-302012-01-17Emc CorporationEfficient backup and restore of storage objects in a version set
US8103906B1 (en)2010-10-012012-01-24Massoud AlibakhshSystem and method for providing total real-time redundancy for a plurality of client-server systems
US8108443B2 (en)2003-09-122012-01-31Oracle International CorporationAggregate functions in DML returning clause
US8126848B2 (en)2006-12-072012-02-28Robert Edward WagnerAutomated method for identifying and repairing logical data discrepancies between database replicas in a database cluster
US20120054155A1 (en)2010-08-312012-03-01Jeffrey DarcySystems and methods for dynamically replicating data objects within a storage network
US20120079224A1 (en)2010-09-292012-03-29International Business Machines CorporationMaintaining mirror and storage system copies of volumes at multiple remote sites
US20120078848A1 (en)2010-09-282012-03-29International Business Machines CorporationMethods for dynamic consistency group formation and systems using the same
US20120076058A1 (en)2010-09-292012-03-29Infosys Technologies LimitedMethod and system for adaptive aggregation of data in a wireless sensor network
US20120084414A1 (en)2010-10-052012-04-05Brock Scott LAutomatic replication of virtual machines
US20120109935A1 (en)2010-11-022012-05-03Microsoft CorporationObject model to key-value data model mapping
US20120109892A1 (en)2010-10-282012-05-03Microsoft CorporationPartitioning online databases
US20120130988A1 (en)2010-11-222012-05-24Ianywhere Solutions, Inc.Highly Adaptable Query Optimizer Search Space Generation Process
US20120131278A1 (en)2010-03-082012-05-24Jichuan ChangData storage apparatus and methods
US20120136835A1 (en)2010-11-302012-05-31Nokia CorporationMethod and apparatus for rebalancing data
US20120138671A1 (en)2010-12-032012-06-07Echostar Technologies L.L.C.Provision of Alternate Content in Response to QR Code
US20120158655A1 (en)2010-12-202012-06-21Microsoft CorporationNon-relational function-based data publication for relational data
US20120159097A1 (en)2010-12-162012-06-21International Business Machines CorporationSynchronous extent migration protocol for paired storage
US20120166517A1 (en)2010-12-282012-06-28Konkuk University Industrial Cooperation CorpIntelligence Virtualization System and Method to support Social Media Cloud Service
US20120166390A1 (en)2010-12-232012-06-28Dwight MerrimanMethod and apparatus for maintaining replica sets
US20120198200A1 (en)2011-01-302012-08-02International Business Machines CorporationMethod and apparatus of memory overload control
US20120221540A1 (en)2011-02-242012-08-30A9.Com, Inc.Encoding of variable-length data with group unary formats
US8260840B1 (en)2010-06-282012-09-04Amazon Technologies, Inc.Dynamic scaling of a cluster of computing nodes used for distributed execution of a program
US20120254175A1 (en)2011-04-012012-10-04Eliot HorowitzSystem and method for optimizing data migration in a partitioned database
US8296419B1 (en)2009-03-312012-10-23Amazon Technologies, Inc.Dynamically modifying a cluster of computing nodes used for distributed execution of a program
US20120274664A1 (en)2011-04-292012-11-01Marc FagnouMobile Device Application for Oilfield Data Visualization
US8305999B2 (en)2007-01-052012-11-06Ravi PalankiResource allocation and mapping in a wireless communication system
US8321558B1 (en)2009-03-312012-11-27Amazon Technologies, Inc.Dynamically monitoring and modifying distributed execution of programs
US20120320914A1 (en)2010-02-252012-12-20Telefonaktiebolaget Lm Ericsson (Publ) method and arrangement for performing link aggregation
US8352463B2 (en)2004-03-302013-01-08Microsoft CorporationIntegrated full text search system and method
US8352450B1 (en)2007-04-192013-01-08Owl Computing Technologies, Inc.Database update through a one-way data link
US20130019296A1 (en)2011-07-152013-01-17Brandenburg John CMethods and systems for processing ad server transactions for internet advertising
US8363961B1 (en)2008-10-142013-01-29Adobe Systems IncorporatedClustering techniques for large, high-dimensionality data sets
US8370857B2 (en)2007-04-202013-02-05Media Logic Corp.Device controller
US8386463B2 (en)2005-07-142013-02-26International Business Machines CorporationMethod and apparatus for dynamically associating different query execution strategies with selective portions of a database table
US8392482B1 (en)2008-03-312013-03-05Amazon Technologies, Inc.Versioning of database partition maps
US20130151477A1 (en)2011-12-082013-06-13Symantec CorporationSystems and methods for restoring application data
US8539197B1 (en)2010-06-292013-09-17Amazon Technologies, Inc.Load rebalancing for shared resource
US20130290471A1 (en)2012-04-272013-10-31Rajat VenkateshManaging transfer of data from a source to a destination machine cluster
US20130290249A1 (en)2010-12-232013-10-31Dwight MerrimanLarge distributed database clustering systems and methods
US8589382B2 (en)2011-12-292013-11-19International Business Machines CorporationMulti-fact query processing in data processing system
US20130332484A1 (en)2012-06-062013-12-12Rackspace Us, Inc.Data Management and Indexing Across a Distributed Database
US20130339379A1 (en)2012-06-132013-12-19Oracle International CorporationInformation retrieval and navigation using a semantic layer and dynamic objects
US8615507B2 (en)2008-12-232013-12-24International Business Machines CorporationDatabase management
US20130346366A1 (en)2012-06-222013-12-26Microsoft CorporationFront end and backend replicated storage
US20140013334A1 (en)2012-07-062014-01-09International Business Machines CorporationLog configuration of distributed applications
US20140032579A1 (en)2012-07-262014-01-30Dwight MerrimanAggregation framework system architecture and method
US20140032628A1 (en)2012-07-242014-01-30International Business Machines CorporationDynamic optimization of command issuance in a computing cluster
US20140032525A1 (en)2012-07-262014-01-30Dwight MerrimanAggregation framework system architecture and method
US20140074790A1 (en)2012-09-122014-03-13International Business Machines CorporationUsing a metadata image of a file system and archive instance to restore data objects in the file system
US20140101100A1 (en)2012-10-052014-04-10Oracle International CorporationProviding services across systems that manage distributed replicas
US8712044B2 (en)2012-06-292014-04-29Dark Matter Labs Inc.Key management system
US8712993B1 (en)2004-06-092014-04-29Teradata Us, Inc.Horizontal aggregations in a relational database management system
US8751533B1 (en)2009-11-252014-06-10Netapp, Inc.Method and system for transparently migrating storage objects between nodes in a clustered storage system
US20140180723A1 (en)2012-12-212014-06-26The Travelers Indemnity CompanySystems and methods for surface segment data
US20140258343A1 (en)2011-09-222014-09-11Retail Logistics Excellence - Relex OyMechanism for updates in a database engine
US20140280380A1 (en)2013-03-132014-09-18Adobe Systems Inc.Method and apparatus for preserving analytics while processing digital content
US20140279929A1 (en)2013-03-152014-09-18Amazon Technologies, Inc.Database system with database engine and separate distributed storage service
US8843441B1 (en)2012-01-172014-09-23Amazon Technologies, Inc.System and method for maintaining a master replica for reads and writes in a data store
US8869256B2 (en)2008-10-212014-10-21Yahoo! Inc.Network aggregator
US20150012797A1 (en)2011-12-122015-01-08Cleversafe, Inc.Storing data in a distributed storage network
US20150016300A1 (en)2013-07-102015-01-15Cisco Technology, Inc.Support for virtual extensible local area network segments across multiple data center sites
US20150074041A1 (en)2013-09-062015-03-12International Business Machines CorporationDeferring data record changes using query rewriting
US20150081766A1 (en)2013-09-172015-03-19Verizon Patent And Licensing Inc.Virtual packet analyzer for a cloud computing environment
US9015431B2 (en)2009-10-292015-04-21Cleversafe, Inc.Distributed storage revision rollbacks
US9069827B1 (en)2012-01-172015-06-30Amazon Technologies, Inc.System and method for adjusting membership of a data replication group
US9116862B1 (en)2012-01-172015-08-25Amazon Technologies, Inc.System and method for data replication using a single master failover protocol
US20150242531A1 (en)2014-02-252015-08-27International Business Machines CorporationDatabase access control for multi-tier processing
US9141814B1 (en)2014-06-032015-09-22Zettaset, Inc.Methods and computer systems with provisions for high availability of cryptographic keys
US9183254B1 (en)2012-05-042015-11-10Paraccel LlcOptimizing database queries using subquery composition
US20150331755A1 (en)2014-05-152015-11-19Carbonite, Inc.Systems and methods for time-based folder restore
US20150341212A1 (en)2014-04-152015-11-26Splunk Inc.Visualizations of statistics associated with captured network data
US20150378786A1 (en)2013-01-312015-12-31Adarsh SuparnaPhysical resource allocation
US20160005423A1 (en)2014-07-022016-01-07Western Digital Technologies, Inc.Data management for a data storage device with zone relocation
US20160048345A1 (en)2014-08-122016-02-18Facebook, Inc.Allocation of read/write channels for storage devices
US9274902B1 (en)2013-08-072016-03-01Amazon Technologies, Inc.Distributed computing fault management
US20160110414A1 (en)2014-10-212016-04-21Samsung Electronics Co., Ltd.Information searching apparatus and control method thereof
US20160110284A1 (en)2014-10-212016-04-21Pranav ATHALYEDistributed cache framework
US9350681B1 (en)2013-05-312016-05-24Gogrid, LLCSystem and method for distributed management of cloud resources in a hosting environment
US20160162374A1 (en)2014-12-032016-06-09Commvault Systems, Inc.Secondary storage editor
US20160188377A1 (en)2014-12-312016-06-30Servicenow, Inc.Classification based automated instance management
US9460008B1 (en)2013-09-202016-10-04Amazon Technologies, Inc.Efficient garbage collection for a log-structured data store
US20160306709A1 (en)2015-04-162016-10-20Nuodb, Inc.Backup and restore in a distributed database utilizing consistent database snapshots
US20160323378A1 (en)2015-04-282016-11-03Microsoft Technology Licensing, LlcState management in distributed computing systems
US9495427B2 (en)2010-06-042016-11-15Yale UniversityProcessing of data using a database system in communication with a data processing framework
US20160364440A1 (en)2015-06-152016-12-15Sap SeEnsuring snapshot monotonicity in asynchronous data replication
US20170032010A1 (en)2015-07-022017-02-02Mongodb, Inc.System and method for augmenting consensus election in a distributed database
US9569481B1 (en)2013-12-102017-02-14Google Inc.Efficient locking of large data collections
US20170091327A1 (en)2015-09-252017-03-30Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US20170109398A1 (en)2015-09-252017-04-20Mongodb, Inc.Systems and methods for data conversion and comparison
US20170109399A1 (en)2015-09-252017-04-20Mongodb, Inc.Systems and methods for data conversion and comparison
US9660666B1 (en)2014-12-222017-05-23EMC IP Holding Company LLCContent-aware lossless compression and decompression of floating point data
US20170169059A1 (en)2015-12-152017-06-15Mongodb, Inc.Systems and methods for automating management of distributed databases
US20170262638A1 (en)2015-09-252017-09-14Eliot HorowitzDistributed database systems and methods with encrypted storage engines
US20170264432A1 (en)2015-09-252017-09-14Eliot HorowitzSystems and methods for hierarchical key management in encrypted distributed databases
US20170262517A1 (en)2012-07-262017-09-14Eliot HorowitzAggregation framework system architecture and method
US20170262519A1 (en)2010-12-232017-09-14Eliot HorowitzSystem and method for determining consensus within a distributed database
US20170262516A1 (en)2012-07-262017-09-14Eliot HorowitzAggregation framework system architecture and method
US20170270176A1 (en)2010-12-232017-09-21Eliot HorowitzSystem and method for determining consensus within a distributed database
US20170286516A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US20170286510A1 (en)2012-07-262017-10-05Eliot HorowitzAggregation framework system architecture and method
US20170286518A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US20170286517A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US20170344618A1 (en)2010-12-232017-11-30Eliot HorowitzSystems and methods for managing distributed database deployments
US20170344441A1 (en)2016-05-312017-11-30Eliot HorowitzMethod and apparatus for reading and writing committed data
US20170371750A1 (en)2016-06-272017-12-28Eliot HorowitzMethod and apparatus for restoring data from snapshots
US20180095852A1 (en)2015-10-222018-04-05Netapp Inc.Implementing automatic switchover
US9959308B1 (en)2014-09-292018-05-01Amazon Technologies, Inc.Non-blocking processing of federated transactions for distributed data partitions
US20180165338A1 (en)2014-12-122018-06-14Microsoft Technology Licensing, LlcControlling Service Functions
US20180300209A1 (en)2014-07-312018-10-18Splunk Inc.High availability scheduler for scheduling map-reduce searches
US20180300381A1 (en)2012-07-262018-10-18Eliot HorowitzAggregation framework system architecture and method
US20180300385A1 (en)2010-12-232018-10-18Dwight MerrimanSystems and methods for database zone sharding and api integration
US20180314750A1 (en)2010-12-232018-11-01Dwight MerrimanSystems and methods for database zone sharding and api integration
US20180343131A1 (en)2017-05-242018-11-29Cisco Technology, Inc.Accessing composite data structures in tiered storage across network nodes
US20180365114A1 (en)2017-06-202018-12-20Eliot HorowitzSystems and methods for optimization of database operations
US10346434B1 (en)2015-08-212019-07-09Amazon Technologies, Inc.Partitioned data materialization in journal-based storage systems
US10372926B1 (en)2015-12-212019-08-06Amazon Technologies, Inc.Passive distribution of encryption keys for distributed data stores

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20130029024A1 (en)*2011-07-252013-01-31David WarrenBarbeque stove

Patent Citations (297)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4918593A (en)1987-01-081990-04-17Wang Laboratories, Inc.Relational database system
US5471629A (en)1988-12-191995-11-28Hewlett-Packard CompanyMethod of monitoring changes in an object-oriented database with tuned monitors
US5416917A (en)1990-03-271995-05-16International Business Machines CorporationHeterogenous database communication system in which communicating systems identify themselves and convert any requests/responses into their own data format
US5379419A (en)1990-12-071995-01-03Digital Equipment CorporationMethods and apparatus for accesssing non-relational data files using relational queries
US5551027A (en)1993-01-071996-08-27International Business Machines CorporationMulti-tiered indexing method for partitioned data
US5598559A (en)1994-07-011997-01-28Hewlett-Packard CompanyMethod and apparatus for optimizing queries having group-by operators
US20050027796A1 (en)1995-06-072005-02-03Microsoft CorporationDirectory service for a computer network
US6112201A (en)1995-08-292000-08-29Oracle CorporationVirtual bookshelf
US5710915A (en)1995-12-211998-01-20Electronic Data Systems CorporationMethod for accelerating access to a database clustered partitioning
US6088524A (en)1995-12-272000-07-11Lucent Technologies, Inc.Method and apparatus for optimizing database queries involving aggregation predicates
US6240514B1 (en)1996-10-182001-05-29Kabushiki Kaisha ToshibaPacket processing device and mobile computer with reduced packet processing overhead
US5884299A (en)1997-02-061999-03-16Ncr CorporationOptimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations
US6385201B1 (en)1997-04-302002-05-07Nec CorporationTopology aggregation using parameter obtained by internodal negotiation
US6115705A (en)1997-05-192000-09-05Microsoft CorporationRelational database system and method for query processing using early aggregation
US20030046307A1 (en)1997-06-022003-03-06Rivette Kevin G.Using hyperbolic trees to visualize data generated by patent-centric and group-oriented data processing
US6611850B1 (en)1997-08-262003-08-26Reliatech Ltd.Method and control apparatus for file backup and restoration
US6249866B1 (en)1997-09-162001-06-19Microsoft CorporationEncrypting file system and method
US5999179A (en)1997-11-171999-12-07Fujitsu LimitedPlatform independent computer network management client
US6065017A (en)1997-12-312000-05-16Novell, Inc.Apparatus and method for identifying and recovering from database errors
US6324540B1 (en)1998-02-232001-11-27Lucent Technologies Inc.Apparatus and method for efficiently partitioning a weighted array
US6324654B1 (en)1998-03-302001-11-27Legato Systems, Inc.Computer network remote data mirroring system
US6240406B1 (en)1998-08-312001-05-29The Trustees Of The University Of PennsylvaniaSystem and method for optimizing queries with aggregates and collection conversions
US20060004746A1 (en)1998-09-042006-01-05Kalido LimitedData processing system
US6363389B1 (en)1998-09-242002-03-26International Business Machines CorporationTechnique for creating a unique quasi-random row identifier
US6351742B1 (en)1999-03-182002-02-26Oracle CorporationMethod and mechanism for database statement optimization
US20060090095A1 (en)1999-03-262006-04-27Microsoft CorporationConsistent cluster operational data in a server cluster using a quorum of replicas
US6496843B1 (en)1999-03-312002-12-17Verizon Laboratories Inc.Generic object for rapid integration of data changes
US6385604B1 (en)1999-08-042002-05-07Hyperroll, Israel LimitedRelational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US6339770B1 (en)1999-08-122002-01-15International Business Machines CorporationQuery simplification and optimization involving eliminating grouping column from group by operation corresponds to group by item that is constant
US7761465B1 (en)1999-09-172010-07-20Sony CorporationData providing system and method therefor
US6505187B1 (en)1999-12-082003-01-07Ncr CorporationComputing multiple order-based functions in a parallel processing database system
US20010021929A1 (en)2000-02-212001-09-13International Business Machines CorporationUser-oriented method and system for database query
US8170984B2 (en)2000-02-282012-05-01Yanicklo Technology Limited Liability CompanySystem with a data aggregation module generating aggregated data for responding to OLAP analysis queries in a user transparent manner
US20020029207A1 (en)2000-02-282002-03-07Hyperroll, Inc.Data aggregation server for managing a multi-dimensional database and database management system having data aggregation server integrated therein
US20060020586A1 (en)2000-03-032006-01-26Michel PromptSystem and method for providing access to databases via directories and other hierarchical structures and interfaces
US6687846B1 (en)2000-03-302004-02-03Intel CorporationSystem and method for error handling and recovery
US6823474B2 (en)2000-05-022004-11-23Sun Microsystems, Inc.Method and system for providing cluster replicated checkpoint services
US20100205028A1 (en)2000-08-142010-08-123M Innovative Properties CompanyData processing system
US20070226640A1 (en)2000-11-152007-09-27Holbrook David MApparatus and methods for organizing and/or presenting data
US20020065677A1 (en)2000-11-272002-05-30First To File, Inc.Computer implemented method of managing information disclosure statements
US20020065675A1 (en)2000-11-272002-05-30Grainger Jeffry J.Computer implemented method of managing information disclosure statements
US20020065676A1 (en)2000-11-272002-05-30First To File, Inc.Computer implemented method of generating information disclosure statements
US7082473B2 (en)2001-02-012006-07-25Lucent Technologies Inc.System and method for optimizing open shortest path first aggregates and autonomous network domain incorporating the same
US20020147842A1 (en)2001-02-012002-10-10Breitbart Yuri J.System and method for optimizing open shortest path first aggregates and autonomous network domain incorporating the same
US7177866B2 (en)2001-03-162007-02-13Gravic, Inc.Asynchronous coordinated commit replication and dual write with replication transmission and locking of target database on updates only
US20040133591A1 (en)*2001-03-162004-07-08Iti, Inc.Asynchronous coordinated commit replication and dual write with replication transmission and locking of target database on updates only
US7246345B1 (en)2001-04-022007-07-17Sun Microsystems, Inc.Method and apparatus for partitioning of managed state for a Java based application
US20020143901A1 (en)2001-04-032002-10-03Gtech Rhode Island CorporationInteractive media response processing system
US20020184239A1 (en)2001-06-012002-12-05Malcolm MosherSystem and method for replication of distributed databases that span multiple primary nodes
US6691101B2 (en)2001-06-212004-02-10Sybase, Inc.Database system providing optimization of group by operator over a union all
US20040186817A1 (en)2001-10-312004-09-23Thames Joseph M.Computer-based structures and methods for generating, maintaining, and modifying a source document and related documentation
US20030084073A1 (en)2001-11-012003-05-01Timo HottiMethod and arrangement for providing an audit of a replica database
US20030088659A1 (en)2001-11-082003-05-08Susarla Hanumantha RaoSystem and method for distributed state management
US7486661B2 (en)2001-11-092009-02-03AlcatelMethod for optimizing the use of network resources for the transmission of data signals, such as voice, over an IP-packet supporting network
US20030182427A1 (en)2002-02-212003-09-25Halpern Eric M.Systems and methods for automated service migration
US6801905B2 (en)2002-03-062004-10-05Sybase, Inc.Database system providing methodology for property enforcement
US20030187864A1 (en)2002-04-022003-10-02Mcgoveran David O.Accessing and updating views and relations in a relational database
US7467103B1 (en)2002-04-172008-12-16Murray Joseph LOptimization system and method for buying clubs
US6920460B1 (en)2002-05-292005-07-19Oracle International CorporationSystems and methods for managing partitioned indexes that are created and maintained by user-defined indexing schemes
US7181460B2 (en)2002-06-182007-02-20International Business Machines CorporationUser-defined aggregate functions in database systems without native support
US20040078569A1 (en)2002-10-212004-04-22Timo HottiMethod and system for managing security material and sevices in a distributed database system
US7657570B2 (en)2002-12-192010-02-02International Business Machines CorporationOptimizing aggregate processing
US7020649B2 (en)2002-12-302006-03-28International Business Machines CorporationSystem and method for incrementally maintaining non-distributive aggregate functions in a relational database
US20040168084A1 (en)2003-02-202004-08-26Bea Systems, Inc.Federated management of content repositories
US6959369B1 (en)2003-03-062005-10-25International Business Machines CorporationMethod, system, and program for data backup
US20040186826A1 (en)2003-03-212004-09-23International Business Machines CorporationReal-time aggregation of unstructured data into structured data for SQL processing by a relational database engine
US20040205048A1 (en)2003-03-282004-10-14Pizzo Michael J.Systems and methods for requesting and receiving database change notifications
US7469253B2 (en)2003-04-012008-12-23Microsoft CorporationAssociative hash partitioning using pseudo-random number generator
US20050033756A1 (en)2003-04-032005-02-10Rajiv KottomtharayilSystem and method for dynamically sharing storage volumes in a computer network
US7668801B1 (en)2003-04-212010-02-23At&T Corp.Method and apparatus for optimizing queries under parametric aggregation constraints
US7191299B1 (en)2003-05-122007-03-13Veritas Operating CorporationMethod and system of providing periodic replication
US20040236743A1 (en)2003-05-232004-11-25Bmc Software, Inc.Database reorganization technique
US7032089B1 (en)2003-06-092006-04-18Veritas Operating CorporationReplica synchronization using copy-on-read technique
US20040254919A1 (en)2003-06-132004-12-16Microsoft CorporationLog parser
US7567991B2 (en)2003-06-252009-07-28Emc CorporationReplication of snapshot using a file system copy differential
US7617369B1 (en)2003-06-302009-11-10Symantec Operating CorporationFast failover with multiple secondary nodes
US20050038833A1 (en)2003-08-142005-02-17Oracle International CorporationManaging workload by service
US8108443B2 (en)2003-09-122012-01-31Oracle International CorporationAggregate functions in DML returning clause
US7472117B2 (en)2003-11-262008-12-30International Business Machines CorporationAbstract query building with selectability of aggregation operations and grouping
US20050192921A1 (en)2004-02-262005-09-01Microsoft CorporationDatabase monitoring system
US20050234841A1 (en)2004-03-302005-10-20Bingjie MiaoGroup-By size result estimation
US8352463B2 (en)2004-03-302013-01-08Microsoft CorporationIntegrated full text search system and method
US8712993B1 (en)2004-06-092014-04-29Teradata Us, Inc.Horizontal aggregations in a relational database management system
US20050283457A1 (en)2004-06-212005-12-22Microsoft CorporationAPI for programmatic retrieval and replay of database trace
US7552356B1 (en)2004-06-302009-06-23Sun Microsystems, Inc.Distributed data storage system for fixed content
US20060085541A1 (en)2004-10-192006-04-20International Business Machines CorporationFacilitating optimization of response time in computer networks
US20060168154A1 (en)2004-11-192006-07-27Microsoft CorporationSystem and method for a distributed object store
US7657578B1 (en)2004-12-202010-02-02Symantec Operating CorporationSystem and method for volume replication in a storage environment employing distributed block virtualization
US20070094237A1 (en)2004-12-302007-04-26Ncr CorporationMultiple active database systems
US7558481B2 (en)2005-01-242009-07-07Tellabs Operations, Inc.Method for optimizing enhanced DWDM networks
US20060209782A1 (en)2005-01-282006-09-21Charles MillerBandwidth optimization system
US20070061487A1 (en)2005-02-012007-03-15Moore James FSystems and methods for use of structured and unstructured distributed data
US20060218123A1 (en)2005-03-282006-09-28Sybase, Inc.System and Methodology for Parallel Query Optimization Using Semantic-Based Partitioning
US20060235905A1 (en)2005-04-142006-10-19Rajesh KapurMethod and system for preserving real-time access to a system in case of a disaster
US20060288232A1 (en)2005-06-162006-12-21Min-Hank HoMethod and apparatus for using an external security device to secure data in a database
US20060294129A1 (en)2005-06-272006-12-28Stanfill Craig WAggregating data with complex operations
US8386463B2 (en)2005-07-142013-02-26International Business Machines CorporationMethod and apparatus for dynamically associating different query execution strategies with selective portions of a database table
US7548928B1 (en)2005-08-052009-06-16Google Inc.Data compression of large scale data stored in sparse tables
US20110225123A1 (en)2005-08-232011-09-15D Souza Roy PMulti-dimensional surrogates for data management
US20070050436A1 (en)2005-08-262007-03-01International Business Machines CorporationOrder-preserving encoding formats of floating-point decimal numbers for efficient value comparison
US20090094318A1 (en)2005-09-302009-04-09Gladwin S ChristopherSmart access to a dispersed data storage network
US7647329B1 (en)2005-12-292010-01-12Amazon Technologies, Inc.Keymap service architecture for a distributed storage system
US8589574B1 (en)2005-12-292013-11-19Amazon Technologies, Inc.Dynamic application instance discovery and state management within a distributed system
US20070203944A1 (en)2006-02-282007-08-30International Business Machines CorporationWeb services database cluster architecture
US20070233746A1 (en)2006-03-302007-10-04Garbow Zachary ATransitioning of Database Service Responsibility Responsive to Server Failure in a Partially Clustered Computing Environment
US20070240129A1 (en)2006-04-062007-10-11Klaus KretzschmarSortable floating point numbers
US7447807B1 (en)2006-06-302008-11-04Siliconsystems, Inc.Systems and methods for storing data in segments of a storage subsystem
US20080002741A1 (en)2006-06-302008-01-03Nokia CorporationApparatus, method and computer program product providing optimized location update for mobile station in a relay-based network
US20080016021A1 (en)2006-07-112008-01-17Dell Products, LpSystem and method of dynamically changing file representations
US20080071755A1 (en)2006-08-312008-03-20Barsness Eric LRe-allocation of resources for query execution in partitions
US20080098041A1 (en)2006-10-202008-04-24Lakshminarayanan ChidambaranServer supporting a consistent client-side cache
US20080288646A1 (en)2006-11-092008-11-20Microsoft CorporationData consistency within a federation infrastructure
US7634459B1 (en)2006-11-162009-12-15Precise Software Solutions Ltd.Apparatus, method and computer-code for detecting changes in database-statement execution paths
US20080140971A1 (en)2006-11-222008-06-12Transitive LimitedMemory consistency protection in a multiprocessor computing system
US8126848B2 (en)2006-12-072012-02-28Robert Edward WagnerAutomated method for identifying and repairing logical data discrepancies between database replicas in a database cluster
US20080162590A1 (en)2007-01-032008-07-03Oracle International CorporationMethod and apparatus for data rollback
US8305999B2 (en)2007-01-052012-11-06Ravi PalankiResource allocation and mapping in a wireless communication system
US8005804B2 (en)2007-03-292011-08-23Redknee Inc.Method and apparatus for adding a database partition
US8352450B1 (en)2007-04-192013-01-08Owl Computing Technologies, Inc.Database update through a one-way data link
US8370857B2 (en)2007-04-202013-02-05Media Logic Corp.Device controller
US20100250930A1 (en)2007-05-092010-09-30Andras CsaszarMethod and apparatus for protecting the routing of data packets
US8086597B2 (en)2007-06-282011-12-27International Business Machines CorporationBetween matching
US20090030986A1 (en)2007-07-272009-01-29Twinstrata, Inc.System and method for remote asynchronous data replication
US20090055350A1 (en)2007-08-242009-02-26International Business Machines CorporationAggregate query optimization
US20090077010A1 (en)2007-09-142009-03-19Brian Robert MurasOptimization of Database Queries Including Grouped Aggregation Functions
US20090222474A1 (en)2008-02-292009-09-03Alpern Bowen LMethod and system for using overlay manifests to encode differences between virtual machine images
US8005868B2 (en)2008-03-072011-08-23International Business Machines CorporationSystem and method for multiple distinct aggregate queries
US20090240744A1 (en)2008-03-212009-09-24Qualcomm IncorporatedPourover journaling
US8082265B2 (en)2008-03-272011-12-20International Business Machines CorporationDatabase query processing with dynamically activated buffers
US8392482B1 (en)2008-03-312013-03-05Amazon Technologies, Inc.Versioning of database partition maps
US20100161492A1 (en)2008-04-142010-06-24Tra, Inc.Analyzing return on investment of advertising campaigns using cross-correlation of multiple data sources
US20090271412A1 (en)2008-04-292009-10-29Maxiscale, Inc.Peer-to-Peer Redundant File Server System and Methods
US8037059B2 (en)2008-05-012011-10-11International Business Machines CorporationImplementing aggregation combination using aggregate depth lists and cube aggregation conversion to rollup aggregation for optimizing query processing
US20100223078A1 (en)2008-06-102010-09-02Dale WillisCustomizable insurance system
US7962458B2 (en)2008-06-122011-06-14Gravic, Inc.Method for replicating explicit locks in a data replication engine
US20100011026A1 (en)2008-07-102010-01-14International Business Machines CorporationMethod and system for dynamically collecting data for checkpoint tuning and reduce recovery time
US20100030793A1 (en)2008-07-312010-02-04Yahoo! Inc.System and method for loading records into a partitioned database table
US20100030800A1 (en)2008-08-012010-02-04International Business Machines CorporationMethod and Apparatus for Generating Partitioning Keys for a Range-Partitioned Database
US20100049717A1 (en)2008-08-202010-02-25Ryan Michael FMethod and systems for sychronization of process control servers
US20100058010A1 (en)2008-09-042010-03-04Oliver AugensteinIncremental backup using snapshot delta views
US8099572B1 (en)2008-09-302012-01-17Emc CorporationEfficient backup and restore of storage objects in a version set
US8363961B1 (en)2008-10-142013-01-29Adobe Systems IncorporatedClustering techniques for large, high-dimensionality data sets
US8869256B2 (en)2008-10-212014-10-21Yahoo! Inc.Network aggregator
US20100106934A1 (en)2008-10-242010-04-29Microsoft CorporationPartition management in a partitioned, scalable, and available structured storage
US20110202792A1 (en)2008-10-272011-08-18Kaminario Technologies Ltd.System and Methods for RAID Writing and Asynchronous Parity Computation
US8615507B2 (en)2008-12-232013-12-24International Business Machines CorporationDatabase management
US20100198791A1 (en)2009-02-052010-08-05Grace Zhanglei WuSystem, method, and computer program product for allowing access to backup data
US20100235606A1 (en)2009-03-112010-09-16Oracle America, Inc.Composite hash and list partitioning of database tables
US8078825B2 (en)2009-03-112011-12-13Oracle America, Inc.Composite hash and list partitioning of database tables
US8321558B1 (en)2009-03-312012-11-27Amazon Technologies, Inc.Dynamically monitoring and modifying distributed execution of programs
US8296419B1 (en)2009-03-312012-10-23Amazon Technologies, Inc.Dynamically modifying a cluster of computing nodes used for distributed execution of a program
US7957284B2 (en)2009-04-012011-06-07Fujitsu LimitedSystem and method for optimizing network bandwidth usage
US20100333111A1 (en)2009-06-292010-12-30Software AgSystems and/or methods for policy-based JMS broker clustering
US20100333116A1 (en)2009-06-302010-12-30Anand PrahladCloud gateway system for managing data storage to cloud storage sites
US20110022642A1 (en)2009-07-242011-01-27Demilo DavidPolicy driven cloud storage management and cloud storage policy router
US9015431B2 (en)2009-10-292015-04-21Cleversafe, Inc.Distributed storage revision rollbacks
US20110125766A1 (en)2009-11-232011-05-26Bank Of America CorporationOptimizing the Efficiency of an Image Retrieval System
US20110125704A1 (en)2009-11-232011-05-26Olga MordvinovaReplica placement strategy for distributed data persistence
US8751533B1 (en)2009-11-252014-06-10Netapp, Inc.Method and system for transparently migrating storage objects between nodes in a clustered storage system
US20110125894A1 (en)2009-11-252011-05-26Novell, Inc.System and method for intelligent workload management
US20110138148A1 (en)2009-12-042011-06-09David FriedmanDynamic Data Storage Repartitioning
US20120320914A1 (en)2010-02-252012-12-20Telefonaktiebolaget Lm Ericsson (Publ) method and arrangement for performing link aggregation
US20120131278A1 (en)2010-03-082012-05-24Jichuan ChangData storage apparatus and methods
US20110225122A1 (en)2010-03-152011-09-15Microsoft CorporationReorganization of data under continuous workload
US20110231447A1 (en)2010-03-182011-09-22Nimbusdb Inc.Database Management System
US20110246717A1 (en)2010-03-312011-10-06Fujitsu LimitedStorage apparatus, recording medium and method for controlling storage apparatus
US9495427B2 (en)2010-06-042016-11-15Yale UniversityProcessing of data using a database system in communication with a data processing framework
US8260840B1 (en)2010-06-282012-09-04Amazon Technologies, Inc.Dynamic scaling of a cluster of computing nodes used for distributed execution of a program
US8539197B1 (en)2010-06-292013-09-17Amazon Technologies, Inc.Load rebalancing for shared resource
US20120054155A1 (en)2010-08-312012-03-01Jeffrey DarcySystems and methods for dynamically replicating data objects within a storage network
US20120078848A1 (en)2010-09-282012-03-29International Business Machines CorporationMethods for dynamic consistency group formation and systems using the same
US20120079224A1 (en)2010-09-292012-03-29International Business Machines CorporationMaintaining mirror and storage system copies of volumes at multiple remote sites
US20120076058A1 (en)2010-09-292012-03-29Infosys Technologies LimitedMethod and system for adaptive aggregation of data in a wireless sensor network
US8103906B1 (en)2010-10-012012-01-24Massoud AlibakhshSystem and method for providing total real-time redundancy for a plurality of client-server systems
US20120084414A1 (en)2010-10-052012-04-05Brock Scott LAutomatic replication of virtual machines
US20120109892A1 (en)2010-10-282012-05-03Microsoft CorporationPartitioning online databases
US20120109935A1 (en)2010-11-022012-05-03Microsoft CorporationObject model to key-value data model mapping
US20120130988A1 (en)2010-11-222012-05-24Ianywhere Solutions, Inc.Highly Adaptable Query Optimizer Search Space Generation Process
US20120136835A1 (en)2010-11-302012-05-31Nokia CorporationMethod and apparatus for rebalancing data
US20120138671A1 (en)2010-12-032012-06-07Echostar Technologies L.L.C.Provision of Alternate Content in Response to QR Code
US20120159097A1 (en)2010-12-162012-06-21International Business Machines CorporationSynchronous extent migration protocol for paired storage
US20120158655A1 (en)2010-12-202012-06-21Microsoft CorporationNon-relational function-based data publication for relational data
US20130290249A1 (en)2010-12-232013-10-31Dwight MerrimanLarge distributed database clustering systems and methods
US20170286518A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US8572031B2 (en)2010-12-232013-10-29Mongodb, Inc.Method and apparatus for maintaining replica sets
US20180300385A1 (en)2010-12-232018-10-18Dwight MerrimanSystems and methods for database zone sharding and api integration
US20180314750A1 (en)2010-12-232018-11-01Dwight MerrimanSystems and methods for database zone sharding and api integration
US10346430B2 (en)2010-12-232019-07-09Mongodb, Inc.System and method for determining consensus within a distributed database
US20170262519A1 (en)2010-12-232017-09-14Eliot HorowitzSystem and method for determining consensus within a distributed database
US20170270176A1 (en)2010-12-232017-09-21Eliot HorowitzSystem and method for determining consensus within a distributed database
US20170344618A1 (en)2010-12-232017-11-30Eliot HorowitzSystems and methods for managing distributed database deployments
US20160203202A1 (en)2010-12-232016-07-14Mongodb, Inc.Method and apparatus for maintaining replica sets
US20170286516A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US20180096045A1 (en)2010-12-232018-04-05Mongodb, Inc.Large distributed database clustering systems and methods
US20170286517A1 (en)2010-12-232017-10-05Eliot HorowitzSystems and methods for managing distributed database deployments
US9317576B2 (en)2010-12-232016-04-19Mongodb, Inc.Method and apparatus for maintaining replica sets
US9805108B2 (en)2010-12-232017-10-31Mongodb, Inc.Large distributed database clustering systems and methods
US20120166390A1 (en)2010-12-232012-06-28Dwight MerrimanMethod and apparatus for maintaining replica sets
US20140164831A1 (en)2010-12-232014-06-12Mongodb, Inc.Method and apparatus for maintaining replica sets
US20120166517A1 (en)2010-12-282012-06-28Konkuk University Industrial Cooperation CorpIntelligence Virtualization System and Method to support Social Media Cloud Service
US20120198200A1 (en)2011-01-302012-08-02International Business Machines CorporationMethod and apparatus of memory overload control
US20120221540A1 (en)2011-02-242012-08-30A9.Com, Inc.Encoding of variable-length data with group unary formats
US20120254175A1 (en)2011-04-012012-10-04Eliot HorowitzSystem and method for optimizing data migration in a partitioned database
US20170322996A1 (en)2011-04-012017-11-09Mongodb, Inc.System and method for optimizing data migration in a partitioned database
US9740762B2 (en)2011-04-012017-08-22Mongodb, Inc.System and method for optimizing data migration in a partitioned database
US20120274664A1 (en)2011-04-292012-11-01Marc FagnouMobile Device Application for Oilfield Data Visualization
US20130019296A1 (en)2011-07-152013-01-17Brandenburg John CMethods and systems for processing ad server transactions for internet advertising
US20140258343A1 (en)2011-09-222014-09-11Retail Logistics Excellence - Relex OyMechanism for updates in a database engine
US20130151477A1 (en)2011-12-082013-06-13Symantec CorporationSystems and methods for restoring application data
US9268639B2 (en)2011-12-122016-02-23International Business Machines CorporationStoring data in a distributed storage network
US20150012797A1 (en)2011-12-122015-01-08Cleversafe, Inc.Storing data in a distributed storage network
US8589382B2 (en)2011-12-292013-11-19International Business Machines CorporationMulti-fact query processing in data processing system
US9069827B1 (en)2012-01-172015-06-30Amazon Technologies, Inc.System and method for adjusting membership of a data replication group
US8843441B1 (en)2012-01-172014-09-23Amazon Technologies, Inc.System and method for maintaining a master replica for reads and writes in a data store
US9116862B1 (en)2012-01-172015-08-25Amazon Technologies, Inc.System and method for data replication using a single master failover protocol
US20150301901A1 (en)2012-01-172015-10-22Amazon Technologies, Inc.System and method for adjusting membership of a data replication group
US20130290471A1 (en)2012-04-272013-10-31Rajat VenkateshManaging transfer of data from a source to a destination machine cluster
US9183254B1 (en)2012-05-042015-11-10Paraccel LlcOptimizing database queries using subquery composition
US20130332484A1 (en)2012-06-062013-12-12Rackspace Us, Inc.Data Management and Indexing Across a Distributed Database
US20130339379A1 (en)2012-06-132013-12-19Oracle International CorporationInformation retrieval and navigation using a semantic layer and dynamic objects
US20130346366A1 (en)2012-06-222013-12-26Microsoft CorporationFront end and backend replicated storage
US8712044B2 (en)2012-06-292014-04-29Dark Matter Labs Inc.Key management system
US20140013334A1 (en)2012-07-062014-01-09International Business Machines CorporationLog configuration of distributed applications
US9350633B2 (en)2012-07-242016-05-24Lenovo Enterprise Solutions (Singapore) Pte. Ltd.Dynamic optimization of command issuance in a computing cluster
US20140032628A1 (en)2012-07-242014-01-30International Business Machines CorporationDynamic optimization of command issuance in a computing cluster
US20170286510A1 (en)2012-07-262017-10-05Eliot HorowitzAggregation framework system architecture and method
US8996463B2 (en)2012-07-262015-03-31Mongodb, Inc.Aggregation framework system architecture and method
US20170262517A1 (en)2012-07-262017-09-14Eliot HorowitzAggregation framework system architecture and method
US20180300381A1 (en)2012-07-262018-10-18Eliot HorowitzAggregation framework system architecture and method
US20140032525A1 (en)2012-07-262014-01-30Dwight MerrimanAggregation framework system architecture and method
US9792322B2 (en)2012-07-262017-10-17Mongodb, Inc.Aggregation framework system architecture and method
US10031956B2 (en)2012-07-262018-07-24Mongodb, Inc.Aggregation framework system architecture and method
US20140032579A1 (en)2012-07-262014-01-30Dwight MerrimanAggregation framework system architecture and method
US20170262516A1 (en)2012-07-262017-09-14Eliot HorowitzAggregation framework system architecture and method
US9262462B2 (en)2012-07-262016-02-16Mongodb, Inc.Aggregation framework system architecture and method
US20180004804A1 (en)2012-07-262018-01-04Mongodb, Inc.Aggregation framework system architecture and method
US10366100B2 (en)2012-07-262019-07-30Mongodb, Inc.Aggregation framework system architecture and method
US20160246861A1 (en)2012-07-262016-08-25Mongodb, Inc.Aggregation framework system architecture and method
US20150278295A1 (en)2012-07-262015-10-01Mongodb, Inc.Aggregation framework system architecture and method
US20140074790A1 (en)2012-09-122014-03-13International Business Machines CorporationUsing a metadata image of a file system and archive instance to restore data objects in the file system
US20140101100A1 (en)2012-10-052014-04-10Oracle International CorporationProviding services across systems that manage distributed replicas
US20140180723A1 (en)2012-12-212014-06-26The Travelers Indemnity CompanySystems and methods for surface segment data
US20150378786A1 (en)2013-01-312015-12-31Adarsh SuparnaPhysical resource allocation
US20140280380A1 (en)2013-03-132014-09-18Adobe Systems Inc.Method and apparatus for preserving analytics while processing digital content
US20140279929A1 (en)2013-03-152014-09-18Amazon Technologies, Inc.Database system with database engine and separate distributed storage service
US9350681B1 (en)2013-05-312016-05-24Gogrid, LLCSystem and method for distributed management of cloud resources in a hosting environment
US20150016300A1 (en)2013-07-102015-01-15Cisco Technology, Inc.Support for virtual extensible local area network segments across multiple data center sites
US9274902B1 (en)2013-08-072016-03-01Amazon Technologies, Inc.Distributed computing fault management
US20150074041A1 (en)2013-09-062015-03-12International Business Machines CorporationDeferring data record changes using query rewriting
US20150081766A1 (en)2013-09-172015-03-19Verizon Patent And Licensing Inc.Virtual packet analyzer for a cloud computing environment
US9460008B1 (en)2013-09-202016-10-04Amazon Technologies, Inc.Efficient garbage collection for a log-structured data store
US9569481B1 (en)2013-12-102017-02-14Google Inc.Efficient locking of large data collections
US20150242531A1 (en)2014-02-252015-08-27International Business Machines CorporationDatabase access control for multi-tier processing
US20150341212A1 (en)2014-04-152015-11-26Splunk Inc.Visualizations of statistics associated with captured network data
US20150331755A1 (en)2014-05-152015-11-19Carbonite, Inc.Systems and methods for time-based folder restore
US9141814B1 (en)2014-06-032015-09-22Zettaset, Inc.Methods and computer systems with provisions for high availability of cryptographic keys
US20160005423A1 (en)2014-07-022016-01-07Western Digital Technologies, Inc.Data management for a data storage device with zone relocation
US20180300209A1 (en)2014-07-312018-10-18Splunk Inc.High availability scheduler for scheduling map-reduce searches
US20160048345A1 (en)2014-08-122016-02-18Facebook, Inc.Allocation of read/write channels for storage devices
US9959308B1 (en)2014-09-292018-05-01Amazon Technologies, Inc.Non-blocking processing of federated transactions for distributed data partitions
US20160110414A1 (en)2014-10-212016-04-21Samsung Electronics Co., Ltd.Information searching apparatus and control method thereof
US20160110284A1 (en)2014-10-212016-04-21Pranav ATHALYEDistributed cache framework
US20160162374A1 (en)2014-12-032016-06-09Commvault Systems, Inc.Secondary storage editor
US20180165338A1 (en)2014-12-122018-06-14Microsoft Technology Licensing, LlcControlling Service Functions
US9660666B1 (en)2014-12-222017-05-23EMC IP Holding Company LLCContent-aware lossless compression and decompression of floating point data
US20160188377A1 (en)2014-12-312016-06-30Servicenow, Inc.Classification based automated instance management
US20160306709A1 (en)2015-04-162016-10-20Nuodb, Inc.Backup and restore in a distributed database utilizing consistent database snapshots
US20160323378A1 (en)2015-04-282016-11-03Microsoft Technology Licensing, LlcState management in distributed computing systems
US20160364440A1 (en)2015-06-152016-12-15Sap SeEnsuring snapshot monotonicity in asynchronous data replication
US20170032010A1 (en)2015-07-022017-02-02Mongodb, Inc.System and method for augmenting consensus election in a distributed database
US10496669B2 (en)2015-07-022019-12-03Mongodb, Inc.System and method for augmenting consensus election in a distributed database
US10346434B1 (en)2015-08-212019-07-09Amazon Technologies, Inc.Partitioned data materialization in journal-based storage systems
US20170091327A1 (en)2015-09-252017-03-30Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US20190303382A1 (en)2015-09-252019-10-03Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US10430433B2 (en)2015-09-252019-10-01Mongodb, Inc.Systems and methods for data conversion and comparison
US20170264432A1 (en)2015-09-252017-09-14Eliot HorowitzSystems and methods for hierarchical key management in encrypted distributed databases
US20170262638A1 (en)2015-09-252017-09-14Eliot HorowitzDistributed database systems and methods with encrypted storage engines
US10394822B2 (en)2015-09-252019-08-27Mongodb, Inc.Systems and methods for data conversion and comparison
US10423626B2 (en)2015-09-252019-09-24Mongodb, Inc.Systems and methods for data conversion and comparison
US20170109398A1 (en)2015-09-252017-04-20Mongodb, Inc.Systems and methods for data conversion and comparison
US10262050B2 (en)2015-09-252019-04-16Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US20170109399A1 (en)2015-09-252017-04-20Mongodb, Inc.Systems and methods for data conversion and comparison
US20170109421A1 (en)2015-09-252017-04-20Mongodb, Inc.Systems and methods for data conversion and comparison
US20180095852A1 (en)2015-10-222018-04-05Netapp Inc.Implementing automatic switchover
US20170169059A1 (en)2015-12-152017-06-15Mongodb, Inc.Systems and methods for automating management of distributed databases
US10489357B2 (en)2015-12-152019-11-26Mongodb, Inc.Systems and methods for automating management of distributed databases
US20190102410A1 (en)2015-12-152019-04-04Mongodb, Inc.Systems and methods for automating management of distributed databases
US10031931B2 (en)2015-12-152018-07-24Mongodb, Inc.Systems and methods for automating management of distributed databases
US20170322954A1 (en)2015-12-152017-11-09Mongodb, Inc.Systems and methods for automating management of distributed databases
US9881034B2 (en)2015-12-152018-01-30Mongodb, Inc.Systems and methods for automating management of distributed databases
US10372926B1 (en)2015-12-212019-08-06Amazon Technologies, Inc.Passive distribution of encryption keys for distributed data stores
US20170344290A1 (en)2016-05-312017-11-30Eliot HorowitzMethod and apparatus for reading and writing committed data
US20170344441A1 (en)2016-05-312017-11-30Eliot HorowitzMethod and apparatus for reading and writing committed data
US20170371750A1 (en)2016-06-272017-12-28Eliot HorowitzMethod and apparatus for restoring data from snapshots
US20170371968A1 (en)2016-06-272017-12-28Eliot HorowitzSystems and methods for monitoring distributed database deployments
US20180343131A1 (en)2017-05-242018-11-29Cisco Technology, Inc.Accessing composite data structures in tiered storage across network nodes
US20180365114A1 (en)2017-06-202018-12-20Eliot HorowitzSystems and methods for optimization of database operations

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
[No Author Listed], Automated Administration Tasks (SQL Server Agent). https://docs.microsoft.com/en-us/sql/ssms/agent/automated-adminsitration-tasks-sql-server-agent. 2 pages. [downloaded Mar. 4, 2017].
Chang et al., Bigtable: a distributed storage system for structured data. OSDI'06: Seventh Symposium on Operating System Design and Implementation. Nov. 2006.
Cooper et al., PNUTS: Yahoo!'s hosted data serving platform. VLDB Endowment. Aug. 2008.
Decandia et al., Dynamo: Amazon's highly available key-value store. SOSP 2007. Oct. 2004.
Nelson et al., Automate MongoDB with MMS. PowerPoint Presentation. Published Jul. 24, 2014. 27 slides. http://www.slideshare.net/mongodb/mms-automation-mongo-db-world.
Ongaro et al., In Search of an Understandable Consensus Algorithm. Proceedings of USENIX ATC '14: 2014 USENIX Annual Technical Conference. Philadelphia, PA. Jun. 19-20, 2014; pp. 305-320.
Poder, Oracle living books. 2009. <http://tech.e2sn.com/oracle/sql/oracle-execution-plan-operation-reference >.
Stirman, Run MongoDB with Confidence using MMS. PowerPoint Presentation. Published Oct. 6, 2014. 34 slides. http://www.slideshare.net/mongodb/mongo-db-boston-run-mongodb-with-mms-20141001.
Van Renesse et al., Chain replication for supporting high throughput and availability. OSDI. 2004: 91-104.
Walsh et al., Xproc: An XML Pipeline Language. May 11, 2011. <https://www.w3.org/TR/xproc/>.
Wikipedia, Dataflow programming. Oct. 2011. <http://en.wikipedia.org/wiki/Dataflow_programming>.
Wikipedia, Pipeline (Unix). Sep. 2011. <http://en.wikipedia.org/wiki/Pipeline (Unix)>.
Wilkins et al., Migrate DB2 applications to a partitioned database. developerWorks, IBM. Apr. 24, 2008, 33 pages.

Cited By (19)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11544288B2 (en)2010-12-232023-01-03Mongodb, Inc.Systems and methods for managing distributed database deployments
US10977277B2 (en)2010-12-232021-04-13Mongodb, Inc.Systems and methods for database zone sharding and API integration
US10997211B2 (en)2010-12-232021-05-04Mongodb, Inc.Systems and methods for database zone sharding and API integration
US11222043B2 (en)2010-12-232022-01-11Mongodb, Inc.System and method for determining consensus within a distributed database
US11615115B2 (en)2010-12-232023-03-28Mongodb, Inc.Systems and methods for managing distributed database deployments
US10846305B2 (en)2010-12-232020-11-24Mongodb, Inc.Large distributed database clustering systems and methods
US12373456B2 (en)2012-07-262025-07-29Mongodb, Inc.Aggregation framework system architecture and method
US10872095B2 (en)2012-07-262020-12-22Mongodb, Inc.Aggregation framework system architecture and method
US10990590B2 (en)2012-07-262021-04-27Mongodb, Inc.Aggregation framework system architecture and method
US11403317B2 (en)2012-07-262022-08-02Mongodb, Inc.Aggregation framework system architecture and method
US11544284B2 (en)2012-07-262023-01-03Mongodb, Inc.Aggregation framework system architecture and method
US11394532B2 (en)2015-09-252022-07-19Mongodb, Inc.Systems and methods for hierarchical key management in encrypted distributed databases
US11288282B2 (en)2015-09-252022-03-29Mongodb, Inc.Distributed database systems and methods with pluggable storage engines
US11537482B2 (en)2016-05-312022-12-27Mongodb, Inc.Method and apparatus for reading and writing committed data
US11481289B2 (en)2016-05-312022-10-25Mongodb, Inc.Method and apparatus for reading and writing committed data
US11520670B2 (en)2016-06-272022-12-06Mongodb, Inc.Method and apparatus for restoring data from snapshots
US11544154B2 (en)2016-06-272023-01-03Mongodb, Inc.Systems and methods for monitoring distributed database deployments
US10866868B2 (en)2017-06-202020-12-15Mongodb, Inc.Systems and methods for optimization of database operations
US11438249B2 (en)2018-10-082022-09-06Alibaba Group Holding LimitedCluster management method, apparatus and system

Also Published As

Publication numberPublication date
US10496669B2 (en)2019-12-03
US20170032010A1 (en)2017-02-02
US20170032007A1 (en)2017-02-02

Similar Documents

PublicationPublication DateTitle
US10713275B2 (en)System and method for augmenting consensus election in a distributed database
US10846305B2 (en)Large distributed database clustering systems and methods
US12229011B2 (en)Scalable log-based continuous data protection for distributed databases
CN111124301B (en)Data consistency storage method and system of object storage device
CN111338766B (en) Transaction processing method, apparatus, computer equipment and storage medium
US9690679B2 (en)Transaction commitment and replication in a storage system
US9201742B2 (en)Method and system of self-managing nodes of a distributed database cluster with a consensus algorithm
US10853182B1 (en)Scalable log-based secondary indexes for non-relational databases
US11138061B2 (en)Method and apparatus to neutralize replication error and retain primary and secondary synchronization during synchronous replication
US10621200B2 (en)Method and apparatus for maintaining replica sets
JP6404907B2 (en) Efficient read replica
CN103268318B (en)A kind of distributed key value database system of strong consistency and reading/writing method thereof
US20130110781A1 (en)Server replication and transaction commitment
US10366106B2 (en)Quorum-based replication of data records
JP6225262B2 (en) System and method for supporting partition level journaling to synchronize data in a distributed data grid
US20080077635A1 (en)Highly Available Clustered Storage Network
JP2023541298A (en) Transaction processing methods, systems, devices, equipment, and programs
US9984139B1 (en)Publish session framework for datastore operation records
US11003550B2 (en)Methods and systems of operating a database management system DBMS in a strong consistency mode
CN109726211B (en)Distributed time sequence database
CN113905054A (en)Kudu cluster data synchronization method, device and system based on RDMA
CN116324731A (en) Access Consistency in High Availability Databases
US10970177B2 (en)Methods and systems of managing consistency and availability tradeoffs in a real-time operational DBMS
PankowskiConsistency and availability of Data in replicated NoSQL databases
Ehsan ul HaquePersistence and Node FailureRecovery in Strongly Consistent Key-Value Datastore

Legal Events

DateCodeTitleDescription
STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:FINAL REJECTION MAILED

STPPInformation on status: patent application and granting procedure in general

Free format text:RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPPInformation on status: patent application and granting procedure in general

Free format text:NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPPInformation on status: patent application and granting procedure in general

Free format text:DOCKETED NEW CASE - READY FOR EXAMINATION

ASAssignment

Owner name:MONGODB, INC., NEW YORK

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MERRIMAN, DWIGHT;HOROWITZ, ELIOT;SCHWERIN, ANDREW MICHALSKI;AND OTHERS;SIGNING DATES FROM 20200203 TO 20200206;REEL/FRAME:051787/0517

STPPInformation on status: patent application and granting procedure in general

Free format text:NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPPInformation on status: patent application and granting procedure in general

Free format text:PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCFInformation on status: patent grant

Free format text:PATENTED CASE

MAFPMaintenance fee payment

Free format text:PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment:4


[8]ページ先頭

©2009-2025 Movatter.jp