- Notifications
You must be signed in to change notification settings - Fork174
Columnar storage extension for Postgres built as a foreign data wrapper. Check outhttps://github.com/citusdata/citus for a modernized columnar storage implementation built as a table access method.
License
citusdata/cstore_fdw
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Columnar storage is now part of theCitus extension, which uses the table access method API to give a much more native experience. It also supports streaming replication, archival, rollback, and simplifies pg_upgrade. You can use Citus just for columnar storage on a single PostgreSQL server, or combine it with distributed tables to parallelize queries locally or across a cluster of PostgreSQL servers.
Migration is simple, and you'll typically see improved compression thanks tozstd:
-- After adding adding shared_preload_libraries = 'citus'-- to postgresql.conf and restarting:CREATE EXTENSION IF NOT EXISTS citus;-- Create a table using the columnar access method, with the same columns-- as an existing cstore_fdw tableCREATETABLEcustomer_reviews_am (LIKE customer_reviews_fdw INCLUDING ALL) USING columnar;-- Copy data from an old cstore_fdw table to an access method tableINSERT INTO customer_reviews_amSELECT*FROM customer_reviews_fdw;-- cstore_fdw data sizeSELECT pg_size_pretty(cstore_table_size('customer_reviews_fdw'));┌────────────────┐│ pg_size_pretty │├────────────────┤│100 MB │└────────────────┘-- Citus Columnar data sizeSELECT pg_size_pretty(pg_table_size('customer_reviews_am'));┌────────────────┐│ pg_size_pretty │├────────────────┤│64 MB │└────────────────┘
Read more about it in theCitus columnar blog post by Jeff Davis.
Cstore_fdw is an open source columnar store extension for PostgreSQL. Columnar stores provide notable benefits for analytics use cases where data is loaded in batches. Cstore_fdw’s columnar nature delivers performance by only reading relevant data from disk, and it may compress data 6x-10x to reduce space requirements for data archival.
Cstore_fdw is developed byCitus Data and can be used in combination withCitus, a postgres extension that intelligently distributes your data and queries across many nodes so your database can scale and your queries are fast. If you have any questions about how Citus can help you scale or how to use Citus in combination with cstore_fdw,please let us know.
Join theMailing List to stay on top of the latest developments for Cstore_fdw.
This extension uses a format for its data layout that is inspired by ORC,the Optimized Row Columnar format. Like ORC, the cstore format improvesupon RCFile developed at Facebook, and brings the following benefits:
- Compression: Reduces in-memory and on-disk data size by 2-4x. Can be extendedto support different codecs.
- Column projections: Only reads column data relevant to the query. Improvesperformance for I/O bound queries.
- Skip indexes: Stores min/max statistics for row groups, and uses them to skipover unrelated rows.
Further, we used the Postgres foreign data wrapper APIs and type representationswith this extension. This brings:
- Support for 40+ Postgres data types. The user can also create new types anduse them.
- Statistics collection. PostgreSQL's query optimizer uses these stats toevaluate different query plans and pick the best one.
- Simple setup. Create foreign table and copy data. Run SQL.
cstore_fdw depends on protobuf-c for serializing and deserializing table metadata.So we need to install these packages first:
# Fedora 17+, CentOS, and Amazon Linuxsudo yum install protobuf-c-devel# Ubuntu 10.4+sudo apt-get install protobuf-c-compilersudo apt-get install libprotobuf-c0-dev# Ubuntu 18.4+sudo apt-get install protobuf-c-compilersudo apt-get install libprotobuf-c-dev# Mac OS Xbrew install protobuf-c
Note. In CentOS 5, 6, and 7, you may need to install or update EPEL 5, 6, or 7 repositories.Seethis pagefor instructions.
Note. In Amazon Linux, the EPEL repository is installed by default, but notenabled. Seethese instructionsfor how to enable it.
Once you have protobuf-c installed on your machine, you are ready to buildcstore_fdw. For this, you need to include the pg_config directory path inyour make command. This path is typically the same as your PostgreSQLinstallation's bin/ directory path. For example:
PATH=/usr/local/pgsql/bin/:$PATH makesudo PATH=/usr/local/pgsql/bin/:$PATH make install
Note. cstore_fdw requires PostgreSQL version from 9.3 to 12. It doesn'tsupport earlier versions of PostgreSQL.
Before using cstore_fdw, you need to add it toshared_preload_libraries
in yourpostgresql.conf
and restart Postgres:
shared_preload_libraries = 'cstore_fdw' # (change requires restart)
The following parameters can be set on a cstore foreign table object.
- filename (optional): The absolute path to the location for storing table data.If you don't specify the filename option, cstore_fdw will automaticallychoose the $PGDATA/cstore_fdw directory to store the files. If specified thevalue of this parameter will be used as a prefix for all files created tostore table data. For example, the value
/cstore_fdw/my_table
could result inthe files/cstore_fdw/my_table
and/cstore_fdw/my_table.footer
being usedto manage table data. - compression (optional): The compression used for compressing value streams.Valid options are
none
andpglz
. The default isnone
. - stripe_row_count (optional): Number of rows per stripe. The default is
150000
. Reducing this decreases the amount memory used for loading dataand querying, but also decreases the performance. - block_row_count (optional): Number of rows per column block. The default is
10000
. cstore_fdw compresses, creates skip indexes, and reads from diskat the block granularity. Increasing this value helps with compression and resultsin fewer reads from disk. However, higher values also reduce the probability ofskipping over unrelated row blocks.
To load or append data into a cstore table, you have two options:
- You can use the
COPY
command to load or append data froma file, a program, or STDIN. - You can use the
INSERT INTO cstore_table SELECT ...
syntax to load orappend data from another table.
You can use theANALYZE
command to collect statisticsabout the table. These statistics help the query planner to help determine themost efficient execution plan for each query.
Note. We currently don't support updating table using DELETE, and UPDATEcommands. We also don't support single row inserts.
To update an existing cstore_fdw installation from versions earlier than 1.6you can take the following steps:
- Download and install cstore_fdw version 1.6 using instructions from the "Building"section,
- Restart the PostgreSQL server,
- Run
ALTER EXTENSION cstore_fdw UPDATE;
As an example, we demonstrate loading and querying data to/from a column storetable from scratch here. Let's start with downloading and decompressing the datafiles.
wget http://examples.citusdata.com/customer_reviews_1998.csv.gzwget http://examples.citusdata.com/customer_reviews_1999.csv.gzgzip -d customer_reviews_1998.csv.gzgzip -d customer_reviews_1999.csv.gz
Then, let's log into Postgres, and run the following commands to create a columnstore foreign table:
-- load extension first time after installCREATE EXTENSION cstore_fdw;-- create server objectCREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;-- create foreign tableCREATE FOREIGN TABLE customer_reviews( customer_idTEXT, review_dateDATE, review_ratingINTEGER, review_votesINTEGER, review_helpful_votesINTEGER, product_idCHAR(10), product_titleTEXT, product_sales_rankBIGINT, product_groupTEXT, product_categoryTEXT, product_subcategoryTEXT, similar_product_idsCHAR(10)[])SERVER cstore_serverOPTIONS(compression'pglz');
Next, we load data into the table:
\COPY customer_reviewsFROM'customer_reviews_1998.csv' WITH CSV;\COPY customer_reviewsFROM'customer_reviews_1999.csv' WITH CSV;
Note. If you are gettingERROR: cannot copy to foreign table "customer_reviews"
when trying to run the COPY commands, double check that youhave added cstore_fdw toshared_preload_libraries
inpostgresql.conf
and restarted Postgres.
Next, we collect data distribution statistics about the table. This is optional,but usually very helpful:
ANALYZE customer_reviews;
Finally, let's run some example SQL queries on the column store table.
-- Find all reviews a particular customer made on the Dune series in 1998.SELECT customer_id, review_date, review_rating, product_id, product_titleFROM customer_reviewsWHERE customer_id='A27T7HVDXA3K2A'AND product_titleLIKE'%Dune%'AND review_date>='1998-01-01'AND review_date<='1998-12-31';-- Do we have a correlation between a book's title's length and its review ratings?SELECT width_bucket(length(product_title),1,50,5) title_length_bucket, round(avg(review_rating),2)AS review_average,count(*)FROM customer_reviewsWHERE product_group='Book'GROUP BY title_length_bucketORDER BY title_length_bucket;
The example above illustrated how to load data into a PostgreSQL database runningon a single host. However, sometimes your data is too large to analyze effectivelyon a single host. Citus is a product built by Citus Data that allows you to runa distributed PostgreSQL database to analyze your data using the power of multiplehosts. You can easily install and run other PostgreSQL extensions and foreign datawrappers—including cstore_fdw—alongside Citus.
You can create a cstore_fdw table and distribute it using thecreate_distributed_table()
UDF just like any other table. You can load datausing thecopy
command as you would do in single node PostgreSQL.
cstore_fdw partitions each column into multiple blocks. Skip indexes store minimumand maximum values for each of these blocks. While scanning the table, if min/maxvalues of the block contradict the WHERE clause, then the block is completelyskipped. This way, the query processes less data and hence finishes faster.
To use skip indexes more efficiently, you should load the data after sorting iton a column that is commonly used in the WHERE clause. This ensures that there isa minimum overlap between blocks and the chance of them being skipped is higher.
In practice, the data generally has an inherent dimension (for example a time field)on which it is naturally sorted. Usually, the queries also have a filter clause onthat column (for example you want to query only the last week's data), and hence youdon't need to sort the data in such cases.
Before uninstalling the extension, first you need to drop all the cstore tables:
postgres=# DROP FOREIGN TABLE cstore_table_1;...postgres=# DROP FOREIGN TABLE cstore_table_n;
Then, you should drop the cstore server and extension:
postgres=# DROP SERVER cstore_server;postgres=# DROP EXTENSION cstore_fdw;
cstore_fdw automatically creates some directories inside the PostgreSQL's datadirectory to store its files. To remove them, you can run:
$ rm -rf $PGDATA/cstore_fdw
Then, you should remove cstore_fdw fromshared_preload_libraries
inyourpostgresql.conf
:
shared_preload_libraries = '' # (change requires restart)
Finally, to uninstall the extension you can run the following command in theextension's source code directory. This will clean up all the files copied duringthe installation:
$ sudo PATH=/usr/local/pgsql/bin/:$PATH make uninstall
- (Fix) Add support for PostgreSQL 12
- (Fix) Support count(t.*) from t type queries
- (Fix) Build failures for MacOS 10.14+
- (Fix) Make foreign scan parallel safe
- (Fix) Add support for PostgreSQL 11 COPY
- (Fix) Add support for PostgreSQL 11
- (Fix) Fix crash during truncate (Cstore crashing server when enabled, not used)
- (Fix) No such file or directory warning when attempting to drop database
- (Feature) Added support for PostgreSQL 10.
- (Fix) Removed table files when a schema, extension or database is dropped.
- (Fix) Removed unused code fragments.
- (Fix) Fixed incorrect initialization of stripe buffers.
- (Fix) Checked user access rights when executing truncate.
- (Fix) Made copy command cancellable.
- (Fix) Fixed namespace issue regarding drop table.
- (Fix) Verify cstore_fdw server on CREATE FOREIGN TABLE command
- (Feature) Added support for PostgreSQL 9.6.
- (Fix) Removed table data when cstore_fdw table is indirectly dropped.
- (Fix) Removed unused code fragments.
- (Fix) Fixed column selection logic to return columns used in expressions.
- (Fix) Prevented alter table command from changinf column type to incompatible types.
- (Fix) Compatibility fix for Cituscopy command.
- (Feature) Added support for
TRUNCATE TABLE
- (Fix) Added support for PostgreSQL 9.5
- (Feature) Added support for
ALTER TABLE ADD COLUMN
andALTER TABLE DROP COLUMN
. - (Feature) Added column list support in
COPY FROM
. - (Optimization) Improve row count estimation, which results in better plans.
- (Fix) Fix the deadlock issue during concurrent inserts.
- (Fix) Return correct result when using whole row references.
- (Feature) Added support for
COPY TO
. - (Feature) Added support for
INSERT INTO cstore_table SELECT ...
. - (Optimization) Improved memory usage.
- (Fix) Dropping multiple cstore tables in a single command cleans-up filesof all them.
- (Feature) Make filename option optional, and use a default directory inside$PGDATA to manage cstore tables.
- (Feature) Automatically delete files on DROP FOREIGN TABLE.
- (Fix) Return empty table if no data has been loaded. Previously, cstore_fdwerrored out.
- (Fix) Fix overestimating relation column counts when planning.
- (Feature) Added cstore_table_size(tablename) for getting the size of a cstoretable in bytes.
Copyright (c) Citus Data, Inc.
This module is free software; you can redistribute it and/or modify it under theApache v2.0 License.
For all types of questions and comments about the wrapper, please contact us atengage @ citusdata.com.
About
Columnar storage extension for Postgres built as a foreign data wrapper. Check outhttps://github.com/citusdata/citus for a modernized columnar storage implementation built as a table access method.