Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Apache DataFusion Python Bindings

License

NotificationsYou must be signed in to change notification settings

apache/datafusion-python

Python testPython Release Build

This is a Python library that binds toApache Arrow in-memory query engineDataFusion.

DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:

  • Dask SQL uses DataFusion's Python bindings for SQL parsing, queryplanning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
  • DataFusion Ballista is a distributed SQL query engine that extendsDataFusion's Python bindings for distributed use cases.
  • DataFusion Ray is another distributed query engine that usesDataFusion's Python bindings.

Features

  • Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources.
  • Queries are optimized using DataFusion's query optimizer.
  • Execute user-defined Python code from SQL.
  • Exchange data with Pandas and other DataFrame libraries that support PyArrow.
  • Serialize and deserialize query plans in Substrait format.
  • Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.

Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the resultsin a Pandas DataFrame, and then plotting a chart.

The Parquet file used in this example can be downloaded from the following page:

fromdatafusionimportSessionContext# Create a DataFusion contextctx=SessionContext()# Register table with contextctx.register_parquet('taxi','yellow_tripdata_2021-01.parquet')# Execute SQLdf=ctx.sql("select passenger_count, count(*) ""from taxi ""where passenger_count is not null ""group by passenger_count ""order by passenger_count")# convert to Pandaspandas_df=df.to_pandas()# create a chartfig=pandas_df.plot(kind="bar",title="Trip Count by Number of Passengers").get_figure()fig.savefig('chart.png')

This produces the following chart:

Chart

Registering a DataFrame as a View

You can use SessionContext'sregister_view method to convert a DataFrame into a view and register it with the context.

fromdatafusionimportSessionContext,col,literal# Create a DataFusion contextctx=SessionContext()# Create sample datadata= {"a": [1,2,3,4,5],"b": [10,20,30,40,50]}# Create a DataFrame from the dictionarydf=ctx.from_pydict(data,"my_table")# Filter the DataFrame (for example, keep rows where a > 2)df_filtered=df.filter(col("a")>literal(2))# Register the dataframe as a view with the contextctx.register_view("view1",df_filtered)# Now run a SQL query against the registered viewdf_view=ctx.sql("SELECT * FROM view1")# Collect the resultsresults=df_view.collect()# Convert results to a list of dictionaries for displayresult_dicts= [batch.to_pydict()forbatchinresults]print(result_dicts)

This will output:

[{'a': [3,4,5],'b': [30,40,50]}]

Configuration

It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.

runtime= (RuntimeEnvBuilder()    .with_disk_manager_os()    .with_fair_spill_pool(10000000))config= (SessionConfig()    .with_create_default_catalog_and_schema(True)    .with_default_catalog_and_schema("foo","bar")    .with_target_partitions(8)    .with_information_schema(True)    .with_repartition_joins(False)    .with_repartition_aggregations(False)    .with_repartition_windows(False)    .with_parquet_pruning(False)    .set("datafusion.execution.parquet.pushdown_filters","true"))ctx=SessionContext(config,runtime)

Refer to theAPI documentation for more information.

Printing the context will show the current configuration settings.

print(ctx)

Extensions

For information about how to extend DataFusion Python, please see the extensions page of theonline documentation.

More Examples

Seeexamples for more information.

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

How to install

uv

uv add datafusion

Pip

pip install datafusion# orpython -m pip install datafusion

Conda

conda install -c conda-forge datafusion

You can verify the installation by running:

>>>importdatafusion>>>datafusion.__version__'0.6.0'

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended bypyo3 andmaturin. The Maturin tools used in this workflow can be installed either viauv orpip. Both approaches should offer the same experience. It is recommended to useuv since it has significant performance improvementsoverpip.

Bootstrap (uv):

By defaultuv will attempt to build the datafusion python package. For our development we prefer to build manually. This meansthat when creating your virtual environment usinguv sync you need to pass in the additional--no-install-package datafusionand foruv run commands the additional parameter--no-project

# fetch this repogit clone git@github.com:apache/datafusion-python.git# create the virtual enviornmentuv sync --dev --no-install-package datafusion# activate the environmentsource .venv/bin/activate

Bootstrap (pip):

# fetch this repogit clone git@github.com:apache/datafusion-python.git# prepare development environment (used to build wheel / install in development)python3 -m venv .venv# activate the venvsource .venv/bin/activate# update pip itself if necessarypython -m pip install -U pip# install dependenciespython -m pip install -r pyproject.toml

The tests rely on test data in git submodules.

git submodule update --init

Whenever rust code changes (your changes or viagit pull):

# make sure you activate the venv using "source venv/bin/activate" firstmaturin develop --uvpython -m pytest

Alternatively if you are usinguv you can do the following withoutneeding to activate the virtual environment:

uv run --no-project maturin develop --uvuv --no-project pytest.

Running & Installing pre-commit hooks

datafusion-python takes advantage ofpre-commit to assist developers with code linting to help reducethe number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for thedeveloper but certainly helpful for keeping PRs clean and concise.

Our pre-commit hooks can be installed by runningpre-commit install, which will install the configurations inyour DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to completethe commit if an offending lint is found allowing you to make changes locally before pushing.

The pre-commit hooks can also be run adhoc without installing them by simply runningpre-commit run --all-files

Running linters without using pre-commit

There are scripts inci/scripts for running Rust and Python linters.

./ci/scripts/python_lint.sh./ci/scripts/rust_clippy.sh./ci/scripts/rust_fmt.sh./ci/scripts/rust_toml_fmt.sh

How to update dependencies

To change test dependencies, change thepyproject.toml and run

uv sync --dev --no-install-package datafusion

About

Apache DataFusion Python Bindings

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors56


[8]ページ先頭

©2009-2025 Movatter.jp