Posted onNov 20, 2023 • Edited onJan 8, 2024

The simplicity of DuckDB

#python #datascience #duckdb #sql

This post is an adaptation of the one I originally publishedin the Orchest blog. Enjoy!

This post is the first part of our series “SQL on Python”, in which we will explore different Python libraries that help you manipulate and query your data using SQL or a SQL-inspired syntax.

Why SQL, after all?

SQL (the initials for Structured Query Language, also known asISO/IEC 9075-1:2016) was originally designed in the 70s for managing relational databases, but nowadays, it is being used for analytics workloads as well.

SQL has lots of benefits for analytics, to name a few:

It’s easy to pick up: SQL is a domain-specific language, rather than a general-purpose language, and as such it has more limited scope and fewer syntax elements to learn.
‍It’s everywhere: SQL is a family of query languages available in many systems, and all of them share some core common characteristics. When you learn a particular SQL dialect (PostgreSQL, SQL Server, Google Standard SQL, others), you can easily transfer your skills from other dialects with ease.
‍It’s fast: SQL is a statically typed language, which allows query planning systems to perform sophisticated optimizations. This, along with the decades of accumulated knowledge about relational databases, allow SQL implementations to have difficult to beat performance.

However, if you are used to the Python or R ecosystems (pandas,Polars, data.table, dplyr), you are probably spoiled by how easy it is to download a CSV or Parquet file from somewhere, launch a Python or R process, read it, and start querying and manipulating it.

Comparatively, this bootstrapping process is a bit more tedious with SQL: assuming you have, say, a local PostgreSQL database up and running and a CSV file, you would need to create a table with the appropriate schema, import the data using COPY, and hope that there are no inconsistencies, missing data, or weird date formats. If the file happened to be Parquet, you would need to work a bit more.

To try to make the process a bit more lightweight, you could try to convert your CSV or Parquet to SQLite, a widely available, in-process SQL database. However, SQLite was designed with transactional use cases in mind, and therefore might not scale well with some analytical workloads.

In summary: SQL is appealing, but the boilerplate not so much. What if you could run SQL for your analytics workloads without having to configure a database, just by importing a module in your Python or R process, and make your queries blazing fast? What if, rather than having to choose between Python or SQL, you could useboth?

https://twitter.com/anyfactor/status/1551650476651081729

Enter DuckDB

DuckDB is an open source (MIT) high-performance, in-process SQL database for analytics. It is a relatively new project (the first public release was in June 2019), but got tremendously popular in a short period of time.

DuckDB popularity is growing (we like this image so much)

DuckDB can read data from different sources:

FromCSV orParquet files
From pandas DataFrame or Arrow Table objects in the process memory
FromPostgreSQL tables (by reading the binary data directly!)

Some of the DuckDB operations have out-of-core capabilities (similar toVaex orthe new streaming mode of Polars), which means that it can read data that is larger than RAM!

Finally, DuckDB offerssome additions on top of standard SQL that make it very pleasant to use, for example friendlier error messages or, behold, trailing commas!

Trying out DuckDB

For this example, we will use a dataset containingall mentions of climate change on Reddit before September 2022 obtained from Kaggle. Our generic goal is to understand the sentiment of these mentions.

I have publishedan Orchest pipeline that contains all the necessary files so you can run these code snippets on JupyterLab easily: the first step downloads the data using your Kaggle API key, and the second step performs some exploratory analysis.

First steps with DuckDB

You can install DuckDB with conda/mamba or pip:

mamba install -y "python-duckdb=0.5.1"  # Or, alternatively, with pip  # pip install "duckdb==0.5.1"

The first step to start using DuckDB is creating a connection object. This mimics thePython Database API 2.0, also implemented by other projects like SQLite and psycopg2:

importduckdbconn=duckdb.connect()

By default,duckdb.connect() will return a connection to an in-memory database, which will be perfectly fine for reading data from external files. In fact, you can run a SQL query directly on the CSV file straight away!

In[3]:conn.execute("""    ...: SELECT COUNT(*)    ...: FROM'/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv'    ...:""").fetchall()Out[3]:[(4600698,)]

As you can see, the comments CSV file contains 4.6+ million rows. This took about 50 seconds on an Orchest instance though, which is not very impressive for just a COUNT(*) operation. What about converting the CSV to Parquet, as we did in myblog post about Arrow? This time, we can use DuckDB for that:

importoscsv_files=!ls/data/reddit-climate/\*.csvforfilenameincsv_files:  print(f"Reading{filename}...")  destination_file=os.path.splitext(filename)[0]+".parquet"  ifos.path.isfile(destination_file):    continue  conn.execute(f"""     COPY (SELECT * FROM'{filename}')     TO'{destination_file}' (FORMAT'parquet')     """)

And now, let’s repeat the query on the Parquet file:

In[8]:%%timeit ...:conn.execute("""    ...: SELECT COUNT(*)    ...: FROM'/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'    ...:""").fetchall()234ms±12.3msperloop(mean±std.dev.of7runs,1loopeach)

Approximately a 200x speedup over the same operation using CSV! That is a better baseline for running the rest of the queries.

Querying Parquet files with DuckDB

Since you will be referring to the same file several times, it’s a good moment tocreate a view for it. This will allow you to query the Parquet file without copying all the data to memory:

conn.execute("""  CREATE VIEW comments AS  SELECT \* FROM'/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet'""")

Next, let’s find out which subreddits had the most number of comments about climate change:

In[11]:conn.query("""     ...: SELECT     ...:  "subreddit.name" AS subreddit\_name,     ...:   COUNT(*) AS num_comments,     ...: FROM comments     ...: GROUP BY subreddit_name     ...: ORDER BY num_comments DESC     ...: LIMIT 10     ...:""").fetchall()Out[11]:[('politics',370018),('worldnews',351195),('askreddit',259848),('collapse',94696),('news',94558),('futurology',89945),('science',71453),('environment',70444),('canada',66813),('australia',60239)]

Unsurprisingly,/r/politics,/r/worldnews, and/r/collapse were among the subreddits with the largest number of comments about climate change.

What about the overall sentiment of those comments?

In[12]:conn.query("""     ...: SELECT     ...:   AVG(sentiment) AS average_sentiment,     ...:   STDDEV(sentiment) AS stddev_sentiment,     ...: FROM comments     ...:""").fetchall()Out[12]:[(-0.005827451977706203,0.6581439484369691)]In[13]:conn.query("""     ...: SELECT     ...:  "subreddit.name" AS subreddit_name,     ...:   COUNT(*) AS num_comments,     ...:   AVG(sentiment) AS average_sentiment,     ...:   STDDEV(sentiment) AS stddev_sentiment,     ...: FROM comments     ...: WHERE subreddit_name IN (     ...:   SELECT"subreddit.name" AS subreddit_name     ...:   FROM comments     ...:   GROUP BY subreddit_name     ...:   ORDER BY COUNT(*) DESC     ...:   LIMIT 10     ...: )     ...: GROUP BY subreddit_name     ...: ORDER BY num_comments DESC     ...:""").fetchall()Out[13]:[('politics',370018,-0.018118589649651674,0.6600297061408),('worldnews',351195,-0.058001587387908435,0.6405990095462681),('askreddit',259848,-0.068637218639235,0.6089748718101456),('collapse',94696,-0.1332661626390419,0.6667106776062662),('news',94558,-0.09367126059175682,0.6276134461239258),('futurology',89945,0.0018637489115630797,0.6506820198836241),('science',71453,0.04588216852922973,0.6248484283076333),('environment',70444,-0.015670189810189843,0.6467846578160414),('canada',66813,0.021118244331091468,0.6408319443539487),('australia',60239,-0.021869519296548085,0.6405803819103508)]

While the overall sentiment is slightly negative (with a large standard deviation), some subreddits like/r/askreddit and/r/collapse exhibited a sentiment more negative than average. Others like/r/science and/r/canada were slightly positive.

Most interestingly, all these queries ran in about 2 seconds!

DuckDB alsohas integration with Jupyter through theipython-sql extension andthe DuckDB SQLAlchemy driver, which allows you to query your data using an even more compact syntax:

In[1]:%load_extsqlIn[2]:%sqlduckdb:///:memory:In[3]:%%sql ...:CREATEVIEWcommentsAS ...:SELECT*FROM'/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet' ...:  ...:*duckdb:///:memory:Done.Out[3]:[]In[4]:%sqlSELECTCOUNT(*)FROMcomments*duckdb:///:memory:Done.Out[4]:[(4600698,)]

Interoperability with Python dataframe libraries

Did you notice how we were usingconn.execute() all the time? As we said above, this method follows the widely used Python DBAPI 2.0. However, DuckDB can return richer objects by usingconn.query() instead:

rel=conn.query("""  SELECT   "subreddit.name" AS subreddit_name,   COUNT(*) AS num_comments,  FROM comments  GROUP BY subreddit_name  ORDER BY num_comments DESC  LIMIT 10""")

This method returns an instance ofDuckDBPyRelation, which can be pretty printed in Jupyter:

In[5]:type(rel)Out[5]:duckdb.DuckDBPyRelationIn[6]:relOut[6]:------------------------RelationTree------------------------Subquery-----------------------ResultColumns ------------------------subreddit_name(VARCHAR)-num_comments(BIGINT)-----------------------ResultPreview -----------------------subreddit_name num_commentsVARCHARBIGINT[Rows:10]politics    370018worldnews   351195askreddit   259848collapse    94696news  94558futurology   89945science71453environment  70444canada 66813australia   60239

Moreover, you can efficiently retrieve the data from this relation and convert it to several Python objects:

A dictionary of masked NumPy arrays using.fetchnumpy()
A pandas DataFrame using.df() or its aliases (.fetchdf(),.fetch_df())
An Arrow Table using.arrow() or.fetch_arrow_table()
An Arrow record batch reader using.fetch_record_batch(chunk_size)

Therefore, you can easily convert query results to a pandas DataFrame, and also a Polars one (since you can pass an Arrow table directly):

In[6]:rel.df() # pandasOut[6]: subreddit_name num_comments0   politics    3700181   worldnews    3511952   askreddit    2598483   collapse    946964     news    945585  futurology    899456    science    714537  environment    704448    canada    668139   australia    60239In[7]:importpolarsasplIn[8]:data=rel.arrow() # Arrow dataIn[9]:pl.DataFrame(data) # PolarsOut[9]:shape:(10,2)┌────────────────┬──────────────┐│subreddit_name┆num_comments││---      ┆---     ││str      ┆i64     │╞════════════════╪══════════════╡│politics   ┆370018   │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│worldnews   ┆351195   │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│askreddit   ┆259848   │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│collapse   ┆94696    │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│...      ┆...     │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│science    ┆71453    │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│environment  ┆70444    │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│canada    ┆66813    │├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│australia   ┆60239    │└────────────────┴──────────────┘

Note: Result objects returned byconn.execute() also have these methods, but they consume the data after they are called and therefore are not so convenient.

Conversely, you can transfer data from pandas or Arrow to DuckDB. Or, more precisely: you can use DuckDB to query pandas or Arrow objects that live in memory! Moreover, DuckDB can read local variables without having to do anything:

In[13]:df_most_comments=rel.df()In[14]:df_most_comments.head() # pandasOut[14]: subreddit_name num_comments0   politics    3700181   worldnews    3511952   askreddit    2598483   collapse    946964     news    94558In[15]:conn.execute("""     ...: SELECT subreddit_name     ...: FROM df_most_comments  -- Sorcery!     ...: LIMIT 5     ...:""").fetchall()Out[15]:[('politics',),('worldnews',),('askreddit',),('collapse',),('news',)]

You can alsomanually register a compatible object with a given name:

In[17]:conn.register("most_comments_arrow",data)Out[17]:<duckdb.DuckDBPyConnectionat0x7f9be41434f0>In[18]:conn.execute("""     ...: SELECT subreddit_name     ...: FROM most_comments_arrow     ...: LIMIT 5     ...:""").fetchall()Out[18]:[('politics',),('worldnews',),('askreddit',),('collapse',),('news',)]

Or, using the%sql magic as before:

In[21]:%sqloutput<<SELECTsubreddit_nameFROMdf_most_commentsLIMIT5*duckdb:///:memory:Done.ReturningdatatolocalvariableoutputIn[22]:output.DataFrame() # pandasOut[22]: subreddit_name0   politics1   worldnews2   askreddit3   collapse4     news

In other words: you can transparently go back and forth between DuckDB and your favourite Python dataframe library. Cool!

Other features

Without extending ourselves too much, there are a few extra interesting things about DuckDB you should check out:

Relational API: Apart from executing your SQL queries, DuckDBPyRelation objects havesome basic filtering and aggregation methods. For example, you can do things like:

In[27]:rel.filter("num_comments > 100000").order("subreddit_name").df()Out[27]: subreddit_name num_comments0   askreddit    2598481   politics    3700182   worldnews    351195

The documentation is still in progress, but potentially the DuckDB team will expand it in the future!

The CLI: DuckDB hasa command-line client you can use directly from your terminal, without even launching a Python or Jupyter interpreter:

$ ./duckdb -c '

>   SELECT "subreddit.name" AS subreddit_name,

>   COUNT(*) AS num_comments

> FROM "/data/reddit-climate/the-reddit-climate-change-dataset-comments.parquet"

> GROUP BY subreddit_name

> ORDER BY num_comments DESC

> LIMIT 10

> '

┌────────────────┬──────────────┐

│ subreddit_name │ num_comments │

├────────────────┼──────────────┤

│ politics       │ 370018       │

│ worldnews      │ 351195       │

│ askreddit      │ 259848       │

│ collapse       │ 94696        │

│ news           │ 94558        │

│ futurology     │ 89945        │

│ science        │ 71453        │

│ environment    │ 70444        │

│ canada         │ 66813        │

│ australia      │ 60239        │

└────────────────┴──────────────┘

Should you use DuckDB?

Although DuckDB is a wonderful piece of technology, “there is no silver bullet” and there might be cases in which you might want to use something else. The project homepage itself hints at some of these cases:

For transactional workloads, you might want to use SQLite, or a more sophisticated transactional database like PostgreSQL. Remember, DuckDB was created for analytics!
When several people are reading or writing the same data, using a warehouse might make more sense.

Other than that, as you saw above if you are looking for a lightweight and fast solution for in-process analytics, and you want to leverage both your general-purpose language of choice (Python, R, others) as well as SQL, DuckDB might be exactly what you want.

In upcoming articles of this series we will describe some more alternatives you might find interesting. Stay tuned!

Thanks toAlex Monahan andElliana May for reviewing early drafts of this blog post. All remaining errors are my own.