Movatterモバイル変換


[0]ホーム

URL:


Live Demo July 24: How to efficiently load data into DuckLake with EstuaryRegister Now

pg_duckdb beta release : Even faster analytics in Postgres

2024/10/23 - 12 min read

BY
INFOEditor's note: this tutorial was originally published 2024-10-23 by has been updated on 2025-02-14 to reflect advancements in pg_duckdb.

In August, weannounced thepg_duckdb extension, a collaborative open-source project withHydra,DuckDB Labs, and MotherDuck.pg_duckdb is a PostgreSQL extension that integrates DuckDB's analytics engine directly into PostgreSQL, allowing for rapid analytical queries alongside traditional transactional workloads.

Two months later, we are happy to share a beta release of the extension, which includes some exciting features like using DuckDB engine to query PostgreSQL data, querying object storage data and much more.

The best way to do analytics in PostgreSQL is to use your favorite Duck database under the hood.

The easiest way to get started is to use theDocker image provided, which includes PostgreSQL with the latest build of thepg_duckdb extension pre-installed.

If you want to install the extension on your own PostgreSQL instance, seethe repository's README for instructions.

Let's first start the container; which will also start a PostgreSQL server :

Copy code

docker run -d --name pg_duckdb -ePOSTGRES_HOST_AUTH_METHOD=trust pgduckdb/pgduckdb:17-v0.3.1

When initializing PostgreSQL, a superuser password must be set. For the sake of demonstration here, we’ve allowed all connections without a password using POSTGRES_HOST_AUTH_METHOD. This is not recommended for production usage.

Now you can connect to PostgreSQL using thepsql command line client:

Copy code

dockerexec -it pg_duckdb psql

If you want to see this in live action, check out the video we made :

Separation of concerns

PostgreSQL is a transactional database, not an analytical one. It is well-suited for lookups, small updates, and running queries when you have carefully set up your indexes and join relationships. It isn’t, however, great when you want to run ad-hoc analytical queries across the full dataset.

PostgreSQL is often used for analytics, even though it's not specifically designed for that purpose. This is because the data is readily available, making it easy to start. However, as the data volume grows and more complex analytical queries involving aggregation and grouping are needed, users often encounter limitations. This is where an analytical database engine like DuckDB comes to the rescue.

Withpg_duckdb, you can use the DuckDB execution engine within PostgreSQL to work with data already stored there, and for some queries, this can result in a dramatic performance improvement. Below is an example query that shows dramatic improvement; however, this obviously does not apply to all queries, and some may actually perform slower when executed in DuckDB.

Let’s try thefirst query of the TPC-DS benchmark suite, which is included inthe TPC-DS DuckDB extension. Using that extension we created asmall script to load the TPC-DS dataset without indexes into PostgreSQL. On a recent Lenovo laptop this results in the following timings for that first query when using scale factor 1 (aka 1GB of total data):

Copy code

$ ./load-tpcds.sh 1$ psql"options=--search-path=tpcds1" -o /dev/nullpsql (17.0)Type"help"forhelp.postgres=# \timing onTiming is on.postgres=# \i 01.sql -- I ran this twice to warm the cacheTime: 81783.057 ms (01:21.783)

We ingested the TPC-DS datasets into PostgreSQL without indexes for two main reasons:

  1. Currently, pg_duckdb does not support indexes, which makes a direct comparison impossible. Addressing this limitation is ahigh priority for us.
    EDIT: Since pg_duckdb 0.3, indexes are supported.
  2. While indexes are common in real-world PostgreSQL scenarios, optimizing them for specific analytic queries can be complicated and bring extra overhead. Considering this, we believe there is value in looking at the performance of queries without any indexes.

Running this query on standard PostgreSQL took 81.8 seconds. That’s pretty slow. Now let’s give it a try with pg_duckdb. We can force it to run using the DuckDB query engine by runningSET duckdb.force_execution = true;.

Copy code

postgres=# SET duckdb.force_execution = true; -- causes execution to use DuckDBTime: 0.287 mspostgres=# \i 01.sqlTime: 52.190 ms

Executing this specific query using DuckDB engine, while the data is stored in PostgreSQL, takes only 52 ms, which ismore than 1500x faster than running in the native engine!

The performance improvement holds even when you scale up to larger data sizes and a production machine. If we run this on EC2 in AWS1, using 10x the data (TPC-DS scale factor 10 instead of 1), this query takes more than 2 hours with the native PostgreSQL execution engine, while it only takes ~400ms when usingpg_duckdb.

This huge performance boost is achieved without any need to change how your data is stored or updated. Everything is still stored in the regular PostgreSQL tables that you're already used to.

However, we can do even better if we store the data in a format that is better for analytics. PostgreSQL stores data in row-oriented format, which is ideal for transactional workloads but can make it harder to do queries that need to scan full columns or do aggregations. By storing the data in columnar format you can get even better performance. The sections below outline how you can use Parquet files and MotherDuck to achieve this inpg_duckdb.

Using pg_duckdb with your Data Lake or Lakehouse

DuckDB has native support for reading and writing files on external object stores like AWS and S3, so it can be ideal for querying data against your Data Lake. DuckDB can also read from iceberg and delta, so you can also take advantage of a Lakehouse approach. The following snippets use datasets from a public bucket, so feel free to try them out yourself!

Reading a Parquet file

The following query usespg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.

Copy code

SELECT r['Title'],max(r['Days In Top 10'])as MaxDaysInTop10FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet') rWHERE r['Type']='TV Show'GROUPBY r['Title']ORDERBY MaxDaysInTop10DESCLIMIT5;

Copy code

Title| MaxDaysInTop10--------------------------------+---------------- Cocomelon|99 Tiger King|44 Jurassic World Camp Cretaceous|31 Tiger King: Murder, Mayhem …|9 Ozark|9(5rows)

Reading an Iceberg table

In order to query against data in Iceberg, you first need to install theDuckDB Iceberg extension. Inpg_duckdb, installing duckdb extensions is done using theduckdb.install_extension(<extension name>) function.

Copy code

-- Install the iceberg extensionSELECT duckdb.install_extension('iceberg');-- Total quantity of items ordered for each `l_shipmode`SELECT r['l_shipmode'],SUM(r['l_quantity'])AS total_quantityFROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths :=true) rGROUPBY r['l_shipmode']ORDERBY total_quantityDESC;

Copy code

l_shipmode| total_quantity------------+---------------- TRUCK|219078 MAIL|216395 FOB|214219 REG AIR|214010 SHIP|213141 RAIL|212903 AIR|211154(7rows)

Writing back to your Data Lake

Access to Data Lakes is not just read-only inpg_duckdb, you can also write back by using theCOPY command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.

Copy code

COPY (SELECT r['Title'],max(r['Days In Top 10'])as MaxDaysInTop10FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet') rWHERE r['Type']='TV Show'GROUPBY r['Title']ORDERBY MaxDaysInTop10DESC LIMIT5)TO's3://my-bucket/results.parquet';

This opens up many possibilities for performing the following operations directly in PostgreSQL:

  • Query existing data from a Data Lake
  • Back up specific PostgreSQL tables to an object store
  • Import data from the Data Lake to support operational applications.

Scaling further with MotherDuck

Analytical queries typically require a lot more hardware than transactional ones. So a PostgreSQL instance that is perfectly fine for handling high numbers of transactions per second may be severely underpowered if you start running analytics.

MotherDuck can help here, and let you leverage their storage and cloud compute resources to give you great analytical performance without impacting your production PostgreSQL instance.

Withpg_duckdb, you can leverage MotherDuck to push your analytical workload to the Cloud again without leaving PostgreSQL.

In addition to a generous free tier, MotherDuck has a free trial where you can get started for 30 days without a credit card. To get started, you can sign up for MotherDuckhere. Next, you'll need togenerate and retrieve an access token for authentication.

The only thing you need to do to makepg_duckdb work with MotherDuck is to set yourmotherduck_token in thepostgresql.conf config file, using theduckdb.motherduck_token parameter. To add this one directly to your runningpg_duckdb container, you can do

Copy code

dockerexec -it pg_duckdb sh -c'echo "duckdb.motherduck_token = '\''<YOUR_MOTHERDUCK_TOKEN>'\''" >> /var/lib/postgresql/data/postgresql.conf'

After that, you will need to restart the container and relaunch apsql session :

Copy code

docker restart pg_duckdbdockerexec -it pg_duckdb psql

If it is more convenient, you can also store the token as an environment variable and addduckdb.motherduck_enabled = true to yourpostgresql.conf.Additional details are available in the README.

Now within PostgreSQL, you can start querying MotherDuck databases or shares. The below query uses asample_data share database accessible by all MotherDuck users.

Copy code

-- number of mention of duckdb in HackerNews in 2022SELECTEXTRACT(YEARFROMtimestamp)ASyear,EXTRACT(MONTHFROMtimestamp)ASmonth,COUNT(*)AS keyword_mentionsFROM ddb$sample_data$hn.hacker_newsWHERE (titleLIKE'%duckdb%'OR textLIKE'%duckdb%')GROUPBYyear,monthORDERBYyearASC,monthASC;

Copy code

year|month|keyword_mentions------+-------+------------------2022|1|62022|2|42022|3|102022|4|92022|5|432022|6|82022|7|152022|8|62022|9|192022|10|102022|11|9

You can join your data in MotherDuck with your live data in PostgreSQL, and you can also easily copy data from one to the other.

For instance, if you create a table by using theUSING duckdb keyword it will be created in MotherDuck, and otherwise it will be in PostgreSQL.

Let’s take the same above query using MotherDuck but now creating a PostgreSQL table :

Copy code

CREATETABLE hacker_news_duckdb_postgresASSELECTEXTRACT(YEARFROMtimestamp)ASyear,EXTRACT(MONTHFROMtimestamp)ASmonth,COUNT(*)AS keyword_mentionsFROM ddb$sample_data$hn.hacker_newsWHERE (titleLIKE'%duckdb%'OR textLIKE'%duckdb%')GROUPBYyear,monthORDERBYyearASC,monthASC;

If we display the existing tables in PostgreSQL, we’ll see this one stored as PostgreSQL table (Access method isheap).

Copy code

postgres=# \d+ List of relationsSchema| Name|Type| Owner| Persistence| Access method| Size| Description --------+-----------------------------+-------+----------+-------------+---------------+------------+------------- public| hacker_news_duckdb_postgres| table| postgres| permanent| heap|8192 bytes|

Now, we can also copy this PostgreSQL table to MotherDuck using :

Copy code

CREATETABLE hacker_news_duckdb_motherduckUSING duckdbASSELECT*FROM hacker_news_duckdb_postgres

The power of the duck in the elephant's hand

While pg_duckdb is still in beta, we are excited about what comes next. You can check out themilestone for the next release to see what’s already on our radar. We still need to trim it based on priorities, though, so if you have certain requests that you think are important, please let us know so they have a higher chance of being part of the next release.

DuckDB's success is all about simplicity, and we are bringing it directly to PostgreSQL users in their existing database.

Check theextension repository for more information, and start playing with your PostgreSQLand MotherDuck account!

1 OS: Ubuntu 24.04, PostgreSQL version: 17.0 (fromhttps://www.postgresql.org/download/linux/ubuntu/), Instance Type: c7a.4xlarge, vCPUs: 16, RAM: 32GB, Disk type: EBS gp3, Disk size: 500 GiB, Disk IOPS: 6000, Disk Throughput: 250MiB/s. PostgreSQL config: shared_buffers = 12GB (scale-factor 10 fits fully in memory), work_mem = 4GB, duckdb.max_memory = 4GB.

CONTENT
  1. Separation of concerns
  2. Using pg_duckdb with your Data Lake or Lakehouse
  3. Scaling further with MotherDuck
  4. The power of the duck in the elephant's hand

Start using MotherDuck now!

blog subscription icon

Subscribe to motherduck blog

PREVIOUS POSTS

You asked, We Listened: Sharing, UI and Performance Improvements

2024/10/22 - Doug Raymond

You asked, We Listened: Sharing, UI and Performance Improvements

Recently-launched features in the MotherDuck data warehouse: preview result cell contents UI, dual execution performance improvements, auto update of data shared within your organization (or globally!)

DuckDB Tutorial For Beginners

2024/10/31 - Mehdi Ouazza, Ryan Boyd

DuckDB Tutorial For Beginners

Get up to speed quickly with DuckDB, including installation, VSCode workflow integration and your first SQL analytics project.


[8]ページ先頭

©2009-2025 Movatter.jp