- Notifications
You must be signed in to change notification settings - Fork118
Apache DataFusion Python Bindings
License
apache/datafusion-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Python library that binds toApache Arrow in-memory query engineDataFusion.
DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:
- Dask SQL uses DataFusion's Python bindings for SQL parsing, queryplanning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
- DataFusion Ballista is a distributed SQL query engine that extendsDataFusion's Python bindings for distributed use cases.
- DataFusion Ray is another distributed query engine that usesDataFusion's Python bindings.
- Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources.
- Queries are optimized using DataFusion's query optimizer.
- Execute user-defined Python code from SQL.
- Exchange data with Pandas and other DataFrame libraries that support PyArrow.
- Serialize and deserialize query plans in Substrait format.
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the resultsin a Pandas DataFrame, and then plotting a chart.
The Parquet file used in this example can be downloaded from the following page:
fromdatafusionimportSessionContext# Create a DataFusion contextctx=SessionContext()# Register table with contextctx.register_parquet('taxi','yellow_tripdata_2021-01.parquet')# Execute SQLdf=ctx.sql("select passenger_count, count(*) ""from taxi ""where passenger_count is not null ""group by passenger_count ""order by passenger_count")# convert to Pandaspandas_df=df.to_pandas()# create a chartfig=pandas_df.plot(kind="bar",title="Trip Count by Number of Passengers").get_figure()fig.savefig('chart.png')
This produces the following chart:
You can use SessionContext'sregister_view
method to convert a DataFrame into a view and register it with the context.
fromdatafusionimportSessionContext,col,literal# Create a DataFusion contextctx=SessionContext()# Create sample datadata= {"a": [1,2,3,4,5],"b": [10,20,30,40,50]}# Create a DataFrame from the dictionarydf=ctx.from_pydict(data,"my_table")# Filter the DataFrame (for example, keep rows where a > 2)df_filtered=df.filter(col("a")>literal(2))# Register the dataframe as a view with the contextctx.register_view("view1",df_filtered)# Now run a SQL query against the registered viewdf_view=ctx.sql("SELECT * FROM view1")# Collect the resultsresults=df_view.collect()# Convert results to a list of dictionaries for displayresult_dicts= [batch.to_pydict()forbatchinresults]print(result_dicts)
This will output:
[{'a': [3,4,5],'b': [30,40,50]}]
It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.
runtime= (RuntimeEnvBuilder() .with_disk_manager_os() .with_fair_spill_pool(10000000))config= (SessionConfig() .with_create_default_catalog_and_schema(True) .with_default_catalog_and_schema("foo","bar") .with_target_partitions(8) .with_information_schema(True) .with_repartition_joins(False) .with_repartition_aggregations(False) .with_repartition_windows(False) .with_parquet_pruning(False) .set("datafusion.execution.parquet.pushdown_filters","true"))ctx=SessionContext(config,runtime)
Refer to theAPI documentation for more information.
Printing the context will show the current configuration settings.
print(ctx)
For information about how to extend DataFusion Python, please see the extensions page of theonline documentation.
Seeexamples for more information.
- Query a Parquet file using SQL
- Query a Parquet file using the DataFrame API
- Run a SQL query and store the results in a Pandas DataFrame
- Run a SQL query with a Python user-defined function (UDF)
- Run a SQL query with a Python user-defined aggregation function (UDAF)
- Query PyArrow Data
- Create dataframe
- Export dataframe
uv add datafusion
pip install datafusion# orpython -m pip install datafusion
conda install -c conda-forge datafusion
You can verify the installation by running:
>>>importdatafusion>>>datafusion.__version__'0.6.0'
This assumes that you have rust and cargo installed. We use the workflow recommended bypyo3 andmaturin. The Maturin tools used in this workflow can be installed either viauv
orpip
. Both approaches should offer the same experience. It is recommended to useuv
since it has significant performance improvementsoverpip
.
Bootstrap (uv
):
By defaultuv
will attempt to build the datafusion python package. For our development we prefer to build manually. This meansthat when creating your virtual environment usinguv sync
you need to pass in the additional--no-install-package datafusion
and foruv run
commands the additional parameter--no-project
# fetch this repogit clone git@github.com:apache/datafusion-python.git# create the virtual enviornmentuv sync --dev --no-install-package datafusion# activate the environmentsource .venv/bin/activate
Bootstrap (pip
):
# fetch this repogit clone git@github.com:apache/datafusion-python.git# prepare development environment (used to build wheel / install in development)python3 -m venv .venv# activate the venvsource .venv/bin/activate# update pip itself if necessarypython -m pip install -U pip# install dependenciespython -m pip install -r pyproject.toml
The tests rely on test data in git submodules.
git submodule update --init
Whenever rust code changes (your changes or viagit pull
):
# make sure you activate the venv using "source venv/bin/activate" firstmaturin develop --uvpython -m pytest
Alternatively if you are usinguv
you can do the following withoutneeding to activate the virtual environment:
uv run --no-project maturin develop --uvuv --no-project pytest.
datafusion-python
takes advantage ofpre-commit to assist developers with code linting to help reducethe number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for thedeveloper but certainly helpful for keeping PRs clean and concise.
Our pre-commit hooks can be installed by runningpre-commit install
, which will install the configurations inyour DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to completethe commit if an offending lint is found allowing you to make changes locally before pushing.
The pre-commit hooks can also be run adhoc without installing them by simply runningpre-commit run --all-files
There are scripts inci/scripts
for running Rust and Python linters.
./ci/scripts/python_lint.sh./ci/scripts/rust_clippy.sh./ci/scripts/rust_fmt.sh./ci/scripts/rust_toml_fmt.sh
To change test dependencies, change thepyproject.toml
and run
uv sync --dev --no-install-package datafusion
About
Apache DataFusion Python Bindings
Resources
License
Code of conduct
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.