Introduction¶

DataFusion is a very fast, extensible query engine for buildinghigh-quality data-centric systems inRust,using theApache Arrow in-memory format.DataFusion originated as part of theApache Arrowproject.

DataFusion offers SQL and Dataframe APIs, excellentperformance, built-in support for CSV, Parquet, JSON, and Avro,python bindings, extensive customization, a great community, and more.

Project Goals¶

DataFusion aims to be the query engine of choice for new, fastdata centric systems such as databases, dataframe libraries, machinelearning and streaming applications by leveraging the unique featuresofRust andApacheArrow.

Features¶

Feature-richSQL support andDataFrame API
Blazingly fast, vectorized, multithreaded, streaming execution engine.
Native support for Parquet, CSV, JSON, and Avro file formats. Supportfor custom file formats and non-file datasources via theTableProvider trait.
Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,other query languages, custom plan and execution nodes, optimizer passes, and more.
Streaming, asynchronous IO directly from popular object stores, including AWS S3,Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via theObjectStore trait).
Excellent Documentation and awelcoming community.
A state of the art query optimizer with expression coercion andsimplification, projection and filter pushdown, sort and distributionaware optimizations, automatic join reordering, and more.
Permissive Apache 2.0 License, predictable and well understoodApache Software Foundation governance.
Implementation inRust, a modernsystem language with development productivity similar to Java orGolang, the performance of C++, andloved by programmerseverywhere.
Support forSubstrait query plans, toeasily pass plans across language and system boundaries.

Use Cases¶

DataFusion can be used without modification as an embedded SQLengine or can be customized and used as a foundation forbuilding new systems.

While most current use cases are “analytic” or (throughput) somecomponents of DataFusion such as the plan representations, aresuitable for “streaming” and “transaction” style systems (lowlatency).

Here are some example systems built using DataFusion:

Specialized Analytical Database systems such asHoraeDB and more general Apache Spark like system such asBallista
New query language engines such asprql-query and accelerators such asVegaFusion
Research platform for new Database Systems, such asFlock
SQL support to another library, such asdask sql
Streaming data platforms such asSynnada
Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such asqv
Native Spark runtime replacement such asBlaze

By using DataFusion, projects are freed to focus on their specificfeatures, and avoid reimplementing general (but still necessary)features such as an expression representation, standard optimizations,parellelized streaming execution plans, file format support, etc.

Known Users¶

Here are some active projects using DataFusion:

Arroyo Distributed stream processing engine in Rust
ArkFlow High-performance Rust stream processing engine
Ballista Distributed SQL Query Engine
Blaze The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing
CnosDB Open Source Distributed Time Series Database
Comet Apache Spark native query execution plugin
Cube Store Cube’s universal semantic layer platform is the next evolution of OLAP technology for AI, BI, spreadsheets, and embedded analytics
Dask SQL Distributed SQL query engine in Python
datafusion-dft Batteries included CLI, TUI, and server implementations for DataFusion.
delta-rs Native Rust implementation of Delta Lake
Exon Analysis toolkit for life-science applications
Feldera Fast query engine for incremental computation
Funnel Data Platform powering Marketing Intelligence applications.
GlareDB Fast SQL database for querying and analyzing distributed data.
GreptimeDB Open Source & Cloud Native Distributed Time Series Database
HoraeDB Distributed Time-Series Database
Iceberg-rust Rust implementation of Apache Iceberg
InfluxDB Time Series Database
Kamu Planet-scale streaming data pipeline
LakeSoul Open source LakeHouse framework with native IO in Rust.
Lance Modern columnar data format for ML
OpenObserve Distributed cloud native observability platform
ParadeDB PostgreSQL for Search & Analytics
Parseable Log storage and observability platform
Polygon.io Stock Market API
qv Quickly view your data
Restate Easily build resilient applications using distributed durable async/await
ROAPI Create full-fledged APIs for slowly moving datasets without writing a single line of code
Sail Unifying stream, batch and AI workloads with Apache Spark compatibility
Seafowl CDN-friendly analytical database
Sleeper Serverless, cloud-native, log-structured merge tree based, scalable key-value store
Spice.ai Building blocks for data-driven AI applications
Synnada Streaming-first framework for data products
VegaFusion Server-side acceleration for theVega visualization grammar
Telemetry Structured logging made easy

Here are some less active projects that used DataFusion:

Integrations and Extensions¶

There are a number of community projects that extend DataFusion orprovide integrations with other systems, some of which are described below:

Language Bindings¶

Integrations¶

Why DataFusion?¶

High Performance: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.
Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet, and Flight), DataFusion works well with the rest of the big data ecosystem
Easy to Embed: Allowing extension at almost any point in its design, and published regularly as a crate oncrates.io, DataFusion can be integrated and tailored for your specific usecase.
High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.

Rust Version Compatibility Policy¶

The Rust toolchain releases are tracked atRust Versions and followsemantic versioning. A Rust toolchain release can be identifiedby a version string like1.80.0, or more generallymajor.minor.patch.

DataFusion supports the last 4 stable Rust minor versions released and any such versions released within the last 4 months.

For example, given the releases1.78.0,1.79.0,1.80.0,1.80.1 and1.81.0 DataFusion will support 1.78.0, which is 3 minor versions prior to the most minor recent1.81.

Note: If a Rust hotfix is released for the current MSRV, the MSRV will be updated to the specific minor version that includes all applicable hotfixes preceding other policies.

DataFusion enforces MSRV policy using aMSRV CI Check

Download

Example Usage

Movatterモバイル変換