This articleis written like apersonal reflection, personal essay, or argumentative essay that states a Wikipedia editor's personal feelings or presents an original argument about a topic. Pleasehelp improve it by rewriting it in anencyclopedic style.(May 2012) (Learn how and when to remove this message) |
Adistributed data store is acomputer network where information is stored on more than onenode, often in areplicated fashion.[1] It is usually specifically used to refer to either adistributed database where users store information on anumber of nodes, or acomputer network in which users store information on anumber of peer network nodes.[citation needed]
Distributed databases are usuallynon-relational databases that enable a quick access to data over a large number of nodes. Some distributed databases expose rich query abilities while others are limited to akey-value store semantics. Examples of limited distributed databases areGoogle'sBigtable, which is much more than adistributed file system or apeer-to-peer network,[2]Amazon'sDynamo[3]andMicrosoft Azure Storage.[4]
As the ability of arbitrary querying is not as important as theavailability, designers of distributed data stores have increased the latter at an expense of consistency. But the high-speed read/write access results in reduced consistency, as it is not possible to guarantee bothconsistency and availability on a partitioned network, as stated by theCAP theorem.
In peer network data stores, the user can usually reciprocate and allow other users to use their computer as a storage node as well. Information may or may not be accessible to other users depending on the design of the network.
Mostpeer-to-peer networks do not have distributed data stores in that the user's data is only available when their node is on the network. However, this distinction is somewhat blurred in a system such asBitTorrent, where it is possible for the originating node to go offline but the content to continue to be served. Still, this is only the case for individual files requested by the redistributors, as contrasted with networks such asHyphanet,Winny,Share andPerfect Dark where any node may be storing any part of the files on the network.
Distributed data stores typically use anerror detection and correction technique.Some distributed data stores (such asParchive over NNTP) useforward error correction techniques to recover the original file when parts of that file are damaged or unavailable.Others try again to download that file from a different mirror.
| Product | License | High availability | Notes |
|---|---|---|---|
| Apache Accumulo | AL2 | ||
| Aerospike | AGPL | ||
| Apache Cassandra | AL2 | Yes | formerly used byFacebook |
| Apache Ignite | AL2 | ||
| Bigtable | Proprietary | used byGoogle | |
| Couchbase | AL2 | used byLinkedIn,PayPal, andeBay | |
| CrateDB | AL2 | Yes | |
| Apache Druid | AL2 | used byNetflix, andYahoo | |
| Dynamo | Proprietary | used byAmazon | |
| etcd | AL2 | Yes | |
| Hazelcast | AL2, Proprietary | ||
| HBase | AL2 | Yes | formerly used by Facebook |
| Hypertable | GPL 2 | Baidu | |
| MongoDB | SSPL | ||
| MySQL NDB Cluster | GPL 2 | Yes | SQL and NoSQL APIs |
| Riak | AL2 | Yes | |
| Redis | BSD License | Yes | |
| ScyllaDB | AGPL | ||
| Voldemort | AL2 | used byLinkedIn |
Although GFS provides Google with reliable, scalable distributed file storage, it does not provide any facility for structuring the data contained in the files beyond a hierarchical directory structure and meaningful file names. It's well known that more expressive solutions are required for large data sets. Google's terabytes upon terabytes of data that they retrieve from web crawlers, amongst many other sources, need organising, so that client applications can quickly perform lookups and updates at a finer granularity than the file level. [...] The very first thing you need to know about Bigtable is that it isn't a relational database. This should come as no surprise: one persistent theme through all of these large scale distributed data store papers is that RDBMSs are hard to do with good performance. There is no hard, fixed schema in a Bigtable, no referential integrity between tables (so no foreign keys) and therefore little support for optimised joins.
Dynamo: a highly available and scalable distributed data store