Movatterモバイル変換

dask-contrib/dask-sqlPublic

NotificationsYou must be signed in to change notification settings
Fork72
Star406

Distributed SQL Engine in Python using Dask

dask-sql.readthedocs.io/

License

MIT license

406 stars 72 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 650 Commits
.cargo		.cargo
.github		.github
continuous_integration		continuous_integration
dask_sql		dask_sql
docs		docs
notebooks		notebooks
src		src
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
setup.cfg		setup.cfg

Repository files navigation

Dask-SQL is currently not in active maintenance, see#1344 for more information

dask-sql is a distributed SQL query engine in Python.It allows you to query and transform your data using a mixture ofcommon SQL operations and Python code and also scale up the calculation easilyif you need it.

Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round.Withdask-sql you can mix the well known Python dataframe API ofpandas andDask with common SQL operations, toprocess your data in exactly the way that is easiest for you.
Infinite Scaling: using the power of the greatDask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - ifDasksupports it, so willdask-sql.
Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain:dask-sql is just a pip/conda install away (or a docker run if you prefer).
Use SQL from wherever you like:dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively withApache Hue.
GPU Support:dask-sql supports running SQL queries on CUDA-enabled GPUs by utilizingRAPIDS libraries likecuDF, enabling accelerated compute for SQL.

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code.Any pandas or dask dataframe can be used as input anddask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

importdask.dataframeasddfromdask_sqlimportContext# Create a context to hold the registered tablesc=Context()# Load the data and register it in the context# This will give the table a name, that we can use in queriesdf=dd.read_csv("...")c.create_table("my_data",df)# Now execute a SQL query. The result is again dask dataframe.result=c.sql("""    SELECT        my_data.name,        SUM(my_data.x)    FROM        my_data    GROUP BY        my_data.name""",return_futures=False)# Show the resultprint(result)

Quickstart

Have a look into thedocumentation or start the example notebook onbinder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction).We are actively looking for feedback, improvements and contributors!

Installation

dask-sql can be installed viaconda (preferred) orpip - or in a development environment.

With`conda`

Create a new conda environment or use your already present environment:

conda create -n dask-sqlconda activate dask-sql

Install the package from theconda-forge channel:

conda install dask-sql -c conda-forge

With`pip`

You can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased)dask-sql version or if you plan to do development ondask-sql, you can also install the package from sources.

git clone https://github.com/dask-contrib/dask-sql.git

Create a new conda environment and install the development environment:

conda env create -f continuous_integration/environment-3.9.yaml

It is not recommended to usepip instead ofconda for the environment setup.

After that, you can install the package in development mode

pip install -e ".[dev]"

The Rust DataFusion bindings are built as part of thepip install.Note that if changes are made to the Rust source insrc/, another build must be run to recompile the bindings.This repository usespre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

GPU-specific tests require additional dependencies specified incontinuous_integration/gpuci/environment.yaml.These can be added to the development environment by running

conda env update -n dask-sql -f continuous_integration/gpuci/environment.yaml

And GPU-specific tests can be run with

pytest tests -m gpu --rungpu

SQL Server

dask-sql comes with a small test implementation for a SQL server.Instead of rebuilding a full ODBC driver, we re-use thepresto wire protocol.It is - so far - only a start of the development and missing important concepts, such asauthentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default)that looks similar to a normal presto database to any presto client.

You can test this for example with the defaultpresto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1; EXPR$0--------    2(1 row)

You can find more information in thedocumentation.

CLI

You can also run the CLIdask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core,dask-sql does two things:

translate the SQL query usingDataFusion into a relational algebra, which is represented as a logical query plan - similar to many other SQL engines (Hive, Flink, ...)
convert this description of the query into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Arrow DataFusion needs to know about the columns and types of the dask dataframes, therefore some Rust code to store this information for dask dataframes are defined indask_planner.After the translation to a relational algebra is done (usingDaskSQLContext.logical_relational_algebra), the python methods defined indask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

About

Distributed SQL Engine in Python using Dask

dask-sql.readthedocs.io/

Releases45

2024.5.0 Latest

May 28, 2024

+ 44 releases

Contributors28

+ 14 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Example

Quickstart

Installation

With`conda`

With`pip`

For development

Testing

SQL Server

CLI

How does it work?

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases45

Contributors28

Uh oh!

Languages

Movatterモバイル変換

License

dask-contrib/dask-sql

Folders and files

Latest commit

History

Repository files navigation

Example

Quickstart

Installation

Withconda

Withpip

For development

Testing

SQL Server

CLI

How does it work?

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases45

Contributors28

Uh oh!

Languages

With`conda`

With`pip`