Movatterモバイル変換

Apache Druid

From Wikipedia, the free encyclopedia

Analytical database software

Apache Druid^[1]

Original author(s)	Metamarkets
Developer(s)	Apache Software Foundation

Stable release	32.0.1^[2] / 19 March 2025; 3 days ago (19 March 2025)

Repository	github.com/apache/druid
Written in	Java
Operating system	Cross-platform
Type	distributed real-time time-series column-oriented data store
License	Apache License 2.0
Website	druid.apache.org

Druid is acolumn-oriented,open-source,distributed data store written inJava. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.^[3] The name Druid comes from theshapeshifting Druid class in manyrole-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Druid is commonly used inbusiness intelligence-OLAP applications to analyze high volumes ofreal-time and historical data.^[4] Druid is used in production by technology companies such asAlibaba,^[4]Airbnb,^[4]Nielsen,^[4]Cisco,^[5]^[4]eBay,^[6]Lyft,^[7]Netflix,^[8]PayPal,^[4]Pinterest,^[9]Reddit,^[10]Twitter,^[11]Walmart,^[12]Wikimedia Foundation^[13] andYahoo.^[14]

History

[edit]

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky^[15] to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012,^[16]^[17]^[18] and moved to an Apache License in February 2015.^[19]^[20]

Architecture

[edit]

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support afault-tolerant architecture^[21] where data is stored redundantly, and there is no single point of failure.^[22] The cluster includes external dependencies for coordination (Apache ZooKeeper), metadata storage (e.g.MySQL,PostgreSQL, orDerby), and a deep storage facility (e.g.HDFS, orAmazon S3) for permanent data backup.

Query management

[edit]

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (orshards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster management

[edit]

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

[edit]

Low latency (streaming) data ingestion.
Arbitrary slice and dice data exploration.
Sub-second analytic queries.
Approximate and exact computations.

Performance

[edit]

In 2019, researchers compared the performance ofHive,Presto, and Druid using a denormalizedStar Schema Benchmark based on theTPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions.^[23]

Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database).


Scale Factor	Hive	Presto	Druid Best	Druid Suboptimal
30	256s	33s	2.09s	3.21s
100	424s	90s	6.12s	8.08s
300	982s	452s	7.60s	20.02s

Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.

References

[edit]

^"Apache Druid at GitHub".github.com. Retrieved4 May 2021.
^"Release 32.0.1". 19 March 2025. Retrieved22 March 2025.
^Hemsoth, Nicole.""Druid Summons Strength in Real-Time"". Archived fromthe original on 2013-02-27. Retrieved2014-02-07.,Datanami, 8 November 2012
^^a ^b ^c ^d ^e ^fdruid."Druid | Powered by Druid".druid.apache.org. Retrieved2016-06-29.
^Butler, Brandon (20 June 2016)."Under the hood of Cisco's Tetration Analytics platform".Archived from the original on 2024-04-26. Retrieved2016-06-23.
^"Druid at Pulsar - ebay的专栏 - 博客频道 - CSDN.NET".blog.csdn.net. Retrieved2016-06-23.
^Streaming SQL and Druid by Arup Malakar, retrieved2020-01-29
^"The Netflix Tech Blog: Announcing Suro: Backbone of Netflix's Data Pipeline".techblog.netflix.com. Retrieved2016-06-23.
^Pinterest: Powering Ad Analytics with Apache Druid, retrieved2020-01-29
^"Scaling Reporting at Reddit - Upvoted".www.redditinc.com. 26 February 2021. Retrieved2022-09-13.
^"Interactive Analytics at MoPub: Querying Terabytes of Data in Seconds".blog.twitter.com. Retrieved2020-01-29.
^Nayak, Amaresh (2018-02-23)."Event Stream Analytics at Walmart with Druid".Medium. Retrieved2020-01-29.
^"Conferences - O'Reilly Media".
^"Complementing Hadoop at Yahoo: Interactive Analytics with Druid". Retrieved2016-06-23.
^"Druid: A Real-time Analytical Data Store"(PDF).
^Tschetter, Eric.""Introducing Druid"". Archived fromthe original on 2022-02-08. Retrieved2019-06-12.,druid.apache.org, 24 October 2012
^Higginbotham, Stacey.""Metamarkets open sources Druid, its in-memory database"". Archived fromthe original on 2021-09-18. Retrieved2014-02-07.,GigaOM, 24 October 2012
^"Metamarkets Open Sources Druid, Streaming Real-Time Data Store".Yahoo News. 2012-10-24. Retrieved2023-07-24.
^Harris, Derrick (2015-02-20)."The Druid real-time database moves to an Apache license". Archived fromthe original on 2015-08-22. Retrieved2015-08-04.
^"Druid Gets Open Source-ier Under the Apache License". Retrieved2015-08-04.
^"Druid Project Documentation".
^Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep.""Druid: A Real-time Analytical Data Store""(PDF).,Metamarkets, retrieved 6 February 2014
^Correia, José; Costa, Carlos; Santos, Maribel Yasmina (2019)."Challenging SQL-on-Hadoop Performance with Apache Druid". In Abramowicz, Witold; Corchuelo, Rafael (eds.).Business Information Systems. Lecture Notes in Business Information Processing. Vol. 353. Cham: Springer International Publishing. pp. 149–161.doi:10.1007/978-3-030-20485-3_12.hdl:1822/66785.ISBN 978-3-030-20485-3.S2CID 190005302.

External links

[edit]

Official website

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category