Movatterモバイル変換


[0]ホーム

URL:


/Shayon Mukherjee/blog

Exploring PostgreSQL to Parquet archival for JSON data with S3 range reads

October 3, 2025
~34 mins

Contents

PostgreSQL handles large JSON payloads reasonably well until you start updating or deleting them frequently. Once payloads cross the 8 KB TOAST threshold and churn becomes high, autovacuum can dominate your I/O budget and cause other issues. I have been exploring the idea of moving older JSON data (read: cold data) to Parquet on S3 while keeping recent data hot in PostgreSQL daily partitions, then simply dropping those partitions instead of running expensiveDELETE operations and subsequent vacuum cycles. What I am about to discuss should be fairly well-known patterns, so lets dive in.

The TOAST problem space

PostgreSQL’s TOAST mechanism kicks in when any column value exceeds 8 KB. The server slices oversized values into 2 KB chunks and stores them in a separate toast relation, keeping the main heap tuples small. This design works well for read-heavy workloads, but creates challenges when large JSON payloads experience frequent updates or deletes.

Consider a typicaljson_payloads table with columns likeid,key,output (the large JSON blob),customer_id,group_id, andcreated_at. When the JSON in theoutput column grows beyond 8 KB, PostgreSQL automatically moves it to TOAST storage, creating the update and delete performance issues we’re exploring.

When you update or delete a large JSON payload, PostgreSQL creates a new row version in the main table and typically writes fresh chunks to the toast table. The old row and toast chunks become dead tuples that autovacuum must scan and reclaim. Under high churn, this creates cascading work across both relations, often resulting in vacuum operations that take hours to complete on tables with hundreds of millions of pages. I will have to write another post just on how bad some of these autovacuums can get.

The core issue is that PostgreSQL’s MVCC design assumes most data modifications affect relatively small row versions. Large JSON documents break this assumption, especially when they’re frequently updated or deleted.

Hot and cold data separation

Let’s explore a two-tier storage approach that treats JSON payload data based on access patterns rather than forcing everything through PostgreSQL’s heap and toast machinery. Recent payloads stay hot in PostgreSQL daily partitions for fast transactional access, while older data moves to cold storage in Parquet files on S3. This kind of setup is useful when your query patterns typically seek recent data more often than older data.

The main thing here is usingDROP PARTITION instead ofDELETE operations. Dropping a partition removes both heap and toast relations instantly, completely avoiding the expensive vacuum cycles that plagued our original approach.

Posjjjjjjjtsssssssgooooooornnnnnnne_______SpppppppQaaaaaaaLyyyyyyyDlllllllR(oooooooOHaaaaaaaPodddddddtsssssssP_______AT2222222Ri0000000Te2222222Ir5555555T_______I-1110000O0009999N7_______0003222d3210987ays)syncS3(P----CaorCRHPlqooTrdulwTeeu-PdTtmgiinrrcefaoaariruntlpge-ecesosp7mtsu+wpausirtphdteipdahssooystrwsiitn)ocns

This separation lets us via PostgreSQL focus on what it does best: ACID transactions, indexed lookups, and managing active data. The cold tier handles what S3 and Parquet excel at: cost-effective storage, compression, and analytical-style queries with column projection.

Parquet file structure and schema mapping

This is a good read to familiarize yourself with Parquet:https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

When we export data from ourjson_payloads table to Parquet, we define a schema that maps our PostgreSQL columns to Parquet’s columnar format:

Posikocgctdeuurrgytsoer(ptuaepuoptSatm_eQyeidLl(rd_:oJ_aaSitjdOds_Noindb_)lpoaby)loadsParikocgcqdeuurru:ytsoee:ptuatIuoptNStm_eSTT:eidc6Rrd_h4IS_:aeNTitm(GRdI:acI:No(NTIlcGI6NoN4T0l(T6)c6(41o4c)lo((lc2co)o4ll)53))

Parquet’s internal organization enables the selective reading that makes this architecture viable. Each file consists of row groups containing columnar data, with a footer that stores schema information and statistics for efficient pruning.

osfip5(k"o{(SRCRfza02ewu"Jcooosey00yetsShwRRRlpkowe:l10bptOeGGGuaeuto,chuaNmG012myytG:1avoottar:::nl:pr0d5aloua:oouo14_0lukcssuoooSadtu08i0um"o"ppffftdi:p25d2en,l:safffa_s47,s"u"tyDssstitn06c)amorlieeeidiubo5lnkiorttts:nlybl0e"nae===tcltyu0r}gdc112imt_etm3ts_t000ci_csen")io249sncosdr498:=oucy,615unu(:070ntsIs060t=tNi,,1=1oTz,42mos6ess,efip5(k"o{(4=iim(rRfza22ewu"J)1zzane_osey00yetsS,0eexux1we:l10bptO4===lt2to,chuaNk8111le3G:1avootte5002_r_r0d5alouay7440cngo14_2lukcss6880oaru08i0um"o"F(,550ulop45d2en,l:sOS77,niu97,s"u"tOTr66tzpR166c)amorTRo,,n=e4O0o5lnkiEIwu0d5(W0l2e"nRNsrrl62u0r}gG=oolp_5Gm3tsM)2ww_a26Rn")E,0sscy0OT0==ol2MUAo,22uo5BPDu00na_Atm00td1cDTpi,,=s0oAAun0)_mTost_mm0pAfip5(k"o{(pii3rRfza42ewu"J(ann_eosey00yetsSSy__hswe:l10bptOTlpposto,chuaNRoaaueG:1avoottIayyrdr0d5alouaNdll_)o24_4lukcssG_oo1u08i0um"o")iaa4p95d2en,l:s,ddd_87,s"u"t=__s216c)amorc5iih7o5lnkiu0dda6l4e"ns0==ru0r}gt155dm3tso,24_n")m000em117ra,,._xpi_mmadpaaraxxq(y__uIlppeNoaatTayy6dll(4_oo)iaa6ddd0=__5iiR2ddG0==s055)460000

The footer statistics enable predicate pushdown. When searching forpayload_id = 5225, we can examine the min/max values and immediately skip Row Groups 0 and 1, knowing they don’t contain our target. This eliminates the need to download and scan irrelevant data.

Deterministic sharding and file organization

We distribute payloads across shards using a hash function to ensure predictable file locations while controlling file proliferation. Eachpayload_id maps to exactly one shard within each time period.

payload_id=5225shard= hash64(5225)%32# Results in shard 7S3Key: customer_123/group_456/2025/10/03/hour_14/shard_07_segment_00.parquet

This creates a hierarchical S3 structure that supports efficient partition pruning:

s3://cpuasytlogomraeodrus_p2/1_0242355/6hh1oo0uu/rr0_ss_ss31hh1hh/3aa4aa/rr/rrdddd____00000107____sssseeeeggggmmmmeeeennnntttt____00000000....ppppaaaarrrrqqqquuuueeeettttTargetfile

Writers append rows to their assigned shard until reaching approximately 128-256 MB, then roll to the next segment. If data volume is high, we might see multiple segments per shard within the same hour—for example,shard_07_segment_00.parquet,shard_07_segment_01.parquet, andshard_07_segment_02.parquet. Within each file, we target row groups of 1-4 MB compressed to optimize for HTTP range requests. This row group sizing is critical because it determines the granularity of our range requests—small enough to minimize data transfer, large enough to amortize request overhead. This approach caps file proliferation while ensuring any payload_id maps to a single predictable location without requiring an index lookup.

Database schema for metadata tracking

To efficiently query this packed data, we can maintain a lightweight catalog in PostgreSQL that tracks S3 files and enables efficient point lookups without pulling the entire Parquet file on every request. Something like

CREATETABLE s3_file_catalog (  id                  SERIALPRIMARYKEY,  customer_id         BIGINTNOTNULL,  group_id            BIGINTNOTNULL,  date                DATENOTNULL,  hour                INTEGERNOTNULL,  shard               INTEGERNOTNULL,  segment             INTEGERNOTNULL,  s3_key              TEXTNOTNULL,  min_payload_id      BIGINTNOTNULL,  max_payload_id      BIGINTNOTNULL,  total_size_bytes    BIGINTNOTNULL,  externalized_ranges INTEGER[]DEFAULT'{}',  created_atTIMESTAMPDEFAULT NOW());CREATETABLE s3_row_groups (  id                  SERIALPRIMARYKEY,  s3_file_catalog_id  BIGINTREFERENCES s3_file_catalog(id),  ordinal             INTEGERNOTNULL,  offset_bytes        BIGINTNOTNULL,  size_bytes          BIGINTNOTNULL,row_count           INTEGERNOTNULL,  min_payload_id      BIGINTNOTNULL,  max_payload_id      BIGINTNOTNULL,  created_atTIMESTAMPDEFAULT NOW());

The catalog remains small because it’s O(files) rather than O(payloads). Each file entry contains the metadata needed to determine whether it could contain a target payload_id, while row group entries provide the byte offsets needed for precise HTTP range requests.

Thes3_row_groups table serves as a performance optimization that caches parsed Parquet footer metadata. While Parquet footers contain all the information we need—including row group byte offsets and column statistics—parsing this metadata on every lookup would require reading the footer and decoding the statistics.

When we build Parquet files from ourjson_payloads data, Parquet automatically generates column statistics for each row group, including min/max values for ourid column (which contains the businesspayload_id values). These statistics are stored as encoded bytes indexed by column position—ourid column statistics appear at index 0 in the footer’s statistics map.

Thes3_row_groups table caches these decoded min/maxpayload_id values along with the row group byte offsets. This optimization is well-established in data lake architectures—systems like Iceberg and Delta Lake maintain similar metadata catalogs to avoid repeated footer parsing. For our use case, this allows us to skip both the footer read and the statistics decoding on every lookup.

Syncing partitions to Parquet

When a daily partition becomes eligible for archival, we can stream its rows through our Parquet writers that handle sharding, size-based externalization, and metadata capture. These Parquet writers are available in many languages like Go, Rust, Python and so on.

defsync_partition_to_s3(partition_name)  writers= {}# "SELECT * FROM #{partition_name} limit...."  fetch_data_in_batches.eachdo|row|    shard= hash64(row[:payload_id])%32    key=[row[:customer_id], row[:group_id], row[:hour], shard]    writer= writers[key]||= create_parquet_writer(key)if estimate_compressed_size(row[:output])>4.megabytes      externalize_to_s3(row)      row[:output]=nil      row[:is_external]=trueend    writer.append_row(row)if writer.size_bytes>256.megabytes      close_and_catalog_writer(writer)      writers[key]= create_parquet_writer(key)endend  writers.each {|key, writer| close_and_catalog_writer(writer) }  execute("DETACH TABLE#{partition_name}")# When ready, drop the partitionend

Theclose_and_catalog_writer function reads the completed Parquet file’s footer metadata and extracts the information needed for efficient lookups. For each row group, it captures the byte offset, compressed size, and row count from the footer directory. More importantly, it decodes the column statistics for ourid column (stored at index 0 in the statistics map) to extract the actual min and max payload_id values from the encoded byte arrays.

This metadata extraction step is crucial because Parquet stores statistics asmin_bytes andmax_bytes that need decoding. For INT64 columns like our payload_id, we decode these as little-endian 8-byte integers. By doing this work once during file creation and caching the results in PostgreSQL, we enable fast indexed lookups without the overhead of footer parsing and byte decoding on every read.

Note: There are other ways to stream this data too, like using logical replication. This kind of setup works as long as you are looking to stream theINSERTs and your data doesn’t have anyUPDATES, and that you can rely on something like S3 lifecycle rule for deletes. There are also other methods like incrementally syncing data. I am leaving the part of efficiently streaming data from PG to S3 in Parquet format for now, but know that there are good levers where you can almost achieve near real time data too.

Point lookup mechanics

When the application requests a specific payload, we follow a sequence designed to minimize both database queries and S3 requests through predicate pushdown and range targeting.

Applic1QccRf4GRDS5DFPRa.uaoei.Eaok.eiretennslTnwiclosiCLrdtueSgnpPotjuoAOyial_3/elsAdeelnTOditic:o:RerctAKPan:dRuaQ:t:RLUGti=Asbd2U2:eOPen4Ntys5E0W{qGfg2Got:5T0HS"uofEmeEEkeri5es1MDrRLesl2Rr=BEoEEyte2E_2MCwC":s5Q10B(OspT:U29rDapE38(eEfyk"aS/1rsrlewyT.7ot&ooyel.6wma,bo.-oFdha/3gfIr_oods1rLoiuo_2h%Tsh4ofTwdtki.aaha6uiEp"ds3rar7plRg=u,SCh2grd5ert=HA6ed_12)o5"AL4=t_0u2o5RC(:07op2u2DU577_n5t2LL2_slp5OA2*eyuOT5.g)tKI)pm"UOae:PNrnqt"F_{L0\O0"W.sptaartquus3QWARoe\.uHNeft"eEDsf:RPrRus\ORyEmle"WUattoNsmx:=kGI3i_2\RN_niR0"OGr_dG9,Uoi8.Pwd>o1._=r7.g<d6}r=5i,"o2n}u52asp25lis2=z52e=1048576

This approach typically completes point lookups in 100-200ms, with most time spent on the S3 range request rather than database operations or Parquet decoding. The main optimization is predicate pushdown combined with precise range targeting.

Predicate pushdown and I/O efficiency

The statistics stored in Parquet footers enable sophisticated predicate pushdown that dramatically reduces I/O compared to naive approaches. Rather than downloading entire files and filtering in memory, we can eliminate entire row groups based on cached metadata from ours3_row_groups table.

Consider a query for payloads wherepayload_id BETWEEN 5300 AND 5350. Without predicate pushdown, we’d download the entire 256 MB file. With row group statistics, we can identify that only Row Group 1 could contain matching records and fetch just that 1-4 MB segment:

Row Group Analysis (from cached metadata):Row Group 0: min_payload_id=5001, max_payload_id=5200 → SKIP (max=5200 < 5300)Row Group 1: min_payload_id=5201, max_payload_id=5400 → FETCH (range 5300-5350 overlaps)Row Group 2: min_payload_id=5401, max_payload_id=5600 → SKIP (min=5401 > 5350)Result: Single range request for Row Group 1 onlyS3 GET Range: bytes=1049600-2098175 (1 MB instead of 256 MB)

This predicate pushdown reduces network transfer by over 99% compared to downloading the full file. Your POC demonstrates this efficiency—fetching a single payload downloads just 2.5 MB and reports “4.2% of file” in the logs, showing how row group targeting minimizes I/O while maintaining the flexibility to handle arbitrarypayload_id ranges.

Performance characteristics and trade-offs

The two-tier architecture delivers predictable performance across different access patterns. Hot data in PostgreSQL maintains sub-10ms response times for indexed lookups, while cold data in Parquet typically responds in 100-200ms depending on row group size and S3 latency.

Storage costs favor the cold tier significantly. PostgreSQL storage includes not just the JSON data but also indexes, toast overhead, and free space from vacuum operations. Parquet also helps achieves 60-80% compression ratios on JSON data and costs roughly $0.023 per GB per month.

The trade-off comes in operational complexity. We now manage two storage systems with different consistency models, backup strategies, and failure modes. However, this complexity is contained within the storage layer and remains invisible to application code through a unified payload access interface.

The most significant operational benefit is eliminating autovacuum pressure on large JSON workloads. By usingDROP PARTITION instead ofDELETE operations, we avoid the expensive vacuum cycles that previously dominated our I/O budget during data retention cleanup.

Considerations

This approach works well for append-heavy workloads where older data becomes read-only, but it introduces some trade-offs worth understanding.

Data consistency: S3 Parquet data is eventually consistent with PostgreSQL. During sync failures or partial writes, temporary inconsistencies can occur. The catalog helps detect these, but applications need to handle cases where metadata exists but S3 objects don’t.

Updates and deletes on archived data: Once JSON payloads move to Parquet, updates become expensive since they require rewriting entire row groups. Deletes need tombstone tracking or file compaction. This design assumes older data rarely changes and you could just rely on S3 lifecycle policies for expiration. If you need frequent updates on archived data, keeping it in PostgreSQL may make more sense.

Future optimizations and considerations

Several enhancements could further improve this architecture, such as async I/O to fetch multiple ranges from a single or multiple parquet files at once. Or, Bloom filters in Parquet footers could reduce false positives when row group min/max ranges overlap with query predicates.

The key architectural principle remains unchanged: leverage PostgreSQL for transactional consistency and fast indexed access on hot data, while using Parquet and S3 for cost-effective storage and analytical-style queries on cold data. This separation of concerns allows each system to operate within its optimal performance envelope while providing a unified interface to applications.

Lastly, there are also services like DuckDB that allow you query this same nature of data using SQL as well, which is quite nice and something I plan on exploring a bit more.

last modified October 3, 2025

[8]ページ先頭

©2009-2025 Movatter.jp