apache/datafusion-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork118
Star471

Apache DataFusion Python Bindings

License

Apache-2.0 license

471 stars 118 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 460 Commits
.cargo		.cargo
.github		.github
benchmarks		benchmarks
ci/scripts		ci/scripts
dev		dev
docs		docs
examples		examples
parquet @ e13af11		parquet @ e13af11
python		python
src		src
testing @ 5bab2f2		testing @ 5bab2f2
.asf.yaml		.asf.yaml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Repository files navigation

DataFusion in Python

This is a Python library that binds toApache Arrow in-memory query engineDataFusion.

DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:

Dask SQL uses DataFusion's Python bindings for SQL parsing, queryplanning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
DataFusion Ballista is a distributed SQL query engine that extendsDataFusion's Python bindings for distributed use cases.
DataFusion Ray is another distributed query engine that usesDataFusion's Python bindings.

Features

Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources.
Queries are optimized using DataFusion's query optimizer.
Execute user-defined Python code from SQL.
Exchange data with Pandas and other DataFrame libraries that support PyArrow.
Serialize and deserialize query plans in Substrait format.
Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.

Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the resultsin a Pandas DataFrame, and then plotting a chart.

The Parquet file used in this example can be downloaded from the following page:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

fromdatafusionimportSessionContext# Create a DataFusion contextctx=SessionContext()# Register table with contextctx.register_parquet('taxi','yellow_tripdata_2021-01.parquet')# Execute SQLdf=ctx.sql("select passenger_count, count(*) ""from taxi ""where passenger_count is not null ""group by passenger_count ""order by passenger_count")# convert to Pandaspandas_df=df.to_pandas()# create a chartfig=pandas_df.plot(kind="bar",title="Trip Count by Number of Passengers").get_figure()fig.savefig('chart.png')

This produces the following chart:

Registering a DataFrame as a View

You can use SessionContext'sregister_view method to convert a DataFrame into a view and register it with the context.

fromdatafusionimportSessionContext,col,literal# Create a DataFusion contextctx=SessionContext()# Create sample datadata= {"a": [1,2,3,4,5],"b": [10,20,30,40,50]}# Create a DataFrame from the dictionarydf=ctx.from_pydict(data,"my_table")# Filter the DataFrame (for example, keep rows where a > 2)df_filtered=df.filter(col("a")>literal(2))# Register the dataframe as a view with the contextctx.register_view("view1",df_filtered)# Now run a SQL query against the registered viewdf_view=ctx.sql("SELECT * FROM view1")# Collect the resultsresults=df_view.collect()# Convert results to a list of dictionaries for displayresult_dicts= [batch.to_pydict()forbatchinresults]print(result_dicts)

This will output:

[{'a': [3,4,5],'b': [30,40,50]}]

Configuration

It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.

runtime= (RuntimeEnvBuilder()    .with_disk_manager_os()    .with_fair_spill_pool(10000000))config= (SessionConfig()    .with_create_default_catalog_and_schema(True)    .with_default_catalog_and_schema("foo","bar")    .with_target_partitions(8)    .with_information_schema(True)    .with_repartition_joins(False)    .with_repartition_aggregations(False)    .with_repartition_windows(False)    .with_parquet_pruning(False)    .set("datafusion.execution.parquet.pushdown_filters","true"))ctx=SessionContext(config,runtime)

Refer to theAPI documentation for more information.

Printing the context will show the current configuration settings.

print(ctx)

Extensions

For information about how to extend DataFusion Python, please see the extensions page of theonline documentation.

More Examples

Seeexamples for more information.

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

Serialize query plans using Substrait

How to install

uv

uv add datafusion

Pip

pip install datafusion# orpython -m pip install datafusion

Conda

conda install -c conda-forge datafusion

You can verify the installation by running:

>>>importdatafusion>>>datafusion.__version__'0.6.0'

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended bypyo3 andmaturin. The Maturin tools used in this workflow can be installed either viauv orpip. Both approaches should offer the same experience. It is recommended to useuv since it has significant performance improvementsoverpip.

Bootstrap (uv):

By defaultuv will attempt to build the datafusion python package. For our development we prefer to build manually. This meansthat when creating your virtual environment usinguv sync you need to pass in the additional--no-install-package datafusionand foruv run commands the additional parameter--no-project

# fetch this repogit clone git@github.com:apache/datafusion-python.git# create the virtual enviornmentuv sync --dev --no-install-package datafusion# activate the environmentsource .venv/bin/activate

Bootstrap (pip):

# fetch this repogit clone git@github.com:apache/datafusion-python.git# prepare development environment (used to build wheel / install in development)python3 -m venv .venv# activate the venvsource .venv/bin/activate# update pip itself if necessarypython -m pip install -U pip# install dependenciespython -m pip install -r pyproject.toml

The tests rely on test data in git submodules.

git submodule update --init

Whenever rust code changes (your changes or viagit pull):

# make sure you activate the venv using "source venv/bin/activate" firstmaturin develop --uvpython -m pytest

Alternatively if you are usinguv you can do the following withoutneeding to activate the virtual environment:

uv run --no-project maturin develop --uvuv --no-project pytest.

Running & Installing pre-commit hooks

datafusion-python takes advantage ofpre-commit to assist developers with code linting to help reducethe number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for thedeveloper but certainly helpful for keeping PRs clean and concise.

Our pre-commit hooks can be installed by runningpre-commit install, which will install the configurations inyour DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to completethe commit if an offending lint is found allowing you to make changes locally before pushing.

The pre-commit hooks can also be run adhoc without installing them by simply runningpre-commit run --all-files

Running linters without using pre-commit

There are scripts inci/scripts for running Rust and Python linters.

./ci/scripts/python_lint.sh./ci/scripts/rust_clippy.sh./ci/scripts/rust_fmt.sh./ci/scripts/rust_toml_fmt.sh

How to update dependencies

To change test dependencies, change thepyproject.toml and run

uv sync --dev --no-install-package datafusion

About

Apache DataFusion Python Bindings

datafusion.apache.org/python

Resources

Readme

License

Apache-2.0 license

Code of conduct

Movatterモバイル変換

License

apache/datafusion-python

Folders and files

Latest commit

History

Repository files navigation

DataFusion in Python

Features

Example Usage

Registering a DataFrame as a View

Configuration

Extensions

More Examples

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

How to install

uv

Pip

Conda

How to develop

Running & Installing pre-commit hooks

Running linters without using pre-commit

How to update dependencies

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors56

Uh oh!

Languages

Packages