Reading and Writing the Apache Parquet Format#

TheApache Parquet project provides astandardized open-source columnar storage format for use in data analysissystems. It was created originally for use inApache Hadoop with systems likeApache Drill,Apache Hive,ApacheImpala, andApache Spark adopting it as a shared standard for highperformance data IO.

Apache Arrow is an ideal in-memory transport layer for data that is being reador written with Parquet files. We have been concurrently developing theC++implementation ofApache Parquet,which includes a native, multithreaded C++ adapter to and from in-memory Arrowdata. PyArrow includes Python bindings to this code, which thus enables readingand writing Parquet files with pandas as well.

Obtaining pyarrow with Parquet Support#

If you installedpyarrow with pip or conda, it should be built with Parquetsupport bundled:

In [1]:importpyarrow.parquetaspq

If you are buildingpyarrow from source, you must use-DARROW_PARQUET=ONwhen compiling the C++ libraries and enable the Parquet extensions whenbuildingpyarrow. If you want to use Parquet Encryption, then you mustuse-DPARQUET_REQUIRE_ENCRYPTION=ON too when compiling the C++ libraries.See thePython Development page for more details.

Reading and Writing Single Files#

The functionsread_table() andwrite_table()read and write thepyarrow.Table object, respectively.

Let’s look at a simple table:

In [2]:importnumpyasnpIn [3]:importpandasaspdIn [4]:importpyarrowaspaIn [5]:df=pd.DataFrame({'one':[-1,np.nan,2.5],   ...:'two':['foo','bar','baz'],   ...:'three':[True,False,True]},   ...:index=list('abc'))   ...:In [6]:table=pa.Table.from_pandas(df)

We write this to Parquet format withwrite_table:

In [7]:importpyarrow.parquetaspqIn [8]:pq.write_table(table,'example.parquet')

This creates a single Parquet file. In practice, a Parquet dataset may consistof many files in many directories. We can read a single file back withread_table:

In [9]:table2=pq.read_table('example.parquet')In [10]:table2.to_pandas()Out[10]:   one  two  threea -1.0  foo   Trueb  NaN  bar  Falsec  2.5  baz   True

You can pass a subset of columns to read, which can be much faster than readingthe whole file (due to the columnar layout):

In [11]:pq.read_table('example.parquet',columns=['one','three'])Out[11]:pyarrow.Tableone: doublethree: bool----one: [[-1,null,2.5]]three: [[true,false,true]]

When reading a subset of columns from a file that used a Pandas dataframe as thesource, we useread_pandas to maintain any additional index column data:

In [12]:pq.read_pandas('example.parquet',columns=['two']).to_pandas()Out[12]:   twoa  foob  barc  baz

We do not need to use a string to specify the origin of the file. It can be any of:

  • A file path as a string

  • ANativeFile from PyArrow

  • A Python file object

In general, a Python file object will have the worst read performance, while astring file path or an instance ofNativeFile (especially memorymaps) will perform the best.

Reading Parquet and Memory Mapping#

Because Parquet data needs to be decoded from the Parquet formatand compression, it can’t be directly mapped from disk.Thus thememory_map option might perform better on some systemsbut won’t help much with resident memory consumption.

>>>pq_array=pa.parquet.read_table("area1.parquet",memory_map=True)>>>print("RSS:{}MB".format(pa.total_allocated_bytes()>>20))RSS: 4299MB>>>pq_array=pa.parquet.read_table("area1.parquet",memory_map=False)>>>print("RSS:{}MB".format(pa.total_allocated_bytes()>>20))RSS: 4299MB

If you need to deal with Parquet data bigger than memory,theTabular Datasets and partitioning is probably what you are looking for.

Parquet file writing options#

write_table() has a number of options tocontrol various settings when writing a Parquet file.

  • version, the Parquet format version to use.'1.0' ensurescompatibility with older readers, while'2.4' and greater valuesenable more Parquet types and encodings.

  • data_page_size, to control the approximate size of encoded datapages within a column chunk. This currently defaults to 1MB.

  • flavor, to set compatibility options particular to a Parquetconsumer like'spark' for Apache Spark.

See thewrite_table() docstring for more details.

There are some additional data type handling-specific optionsdescribed below.

Omitting the DataFrame index#

When usingpa.Table.from_pandas to convert to an Arrow table, by defaultone or more special columns are added to keep track of the index (rowlabels). Storing the index takes extra space, so if your index is not valuable,you may choose to omit it by passingpreserve_index=False

In [13]:df=pd.DataFrame({'one':[-1,np.nan,2.5],   ....:'two':['foo','bar','baz'],   ....:'three':[True,False,True]},   ....:index=list('abc'))   ....:In [14]:dfOut[14]:   one  two  threea -1.0  foo   Trueb  NaN  bar  Falsec  2.5  baz   TrueIn [15]:table=pa.Table.from_pandas(df,preserve_index=False)

Then we have:

In [16]:pq.write_table(table,'example_noindex.parquet')In [17]:t=pq.read_table('example_noindex.parquet')In [18]:t.to_pandas()Out[18]:   one  two  three0 -1.0  foo   True1  NaN  bar  False2  2.5  baz   True

Here you see the index did not survive the round trip.

Finer-grained Reading and Writing#

read_table uses theParquetFile class, which has other features:

In [19]:parquet_file=pq.ParquetFile('example.parquet')In [20]:parquet_file.metadataOut[20]:<pyarrow._parquet.FileMetaData object at 0x7fdfff1ce3e0>  created_by: parquet-cpp-arrow version 22.0.0  num_columns: 4  num_rows: 3  num_row_groups: 1  format_version: 2.6  serialized_size: 2652In [21]:parquet_file.schemaOut[21]:<pyarrow._parquet.ParquetSchema object at 0x7fe0be78b680>required group field_id=-1 schema {  optional double field_id=-1 one;  optional binary field_id=-1 two (String);  optional boolean field_id=-1 three;  optional binary field_id=-1 __index_level_0__ (String);}

As you can learn more in theApache Parquet format, a Parquet file consists ofmultiple row groups.read_table will read all of the row groups andconcatenate them into a single table. You can read individual row groups withread_row_group:

In [22]:parquet_file.num_row_groupsOut[22]:1In [23]:parquet_file.read_row_group(0)Out[23]:pyarrow.Tableone: doubletwo: stringthree: bool__index_level_0__: string----one: [[-1,null,2.5]]two: [["foo","bar","baz"]]three: [[true,false,true]]__index_level_0__: [["a","b","c"]]

We can similarly write a Parquet file with multiple row groups by usingParquetWriter:

In [24]:withpq.ParquetWriter('example2.parquet',table.schema)aswriter:   ....:foriinrange(3):   ....:writer.write_table(table)   ....:In [25]:pf2=pq.ParquetFile('example2.parquet')In [26]:pf2.num_row_groupsOut[26]:3

Inspecting the Parquet File Metadata#

TheFileMetaData of a Parquet file can be accessed throughParquetFile as shown above:

In [27]:parquet_file=pq.ParquetFile('example.parquet')In [28]:metadata=parquet_file.metadata

or can also be read directly usingread_metadata():

In [29]:metadata=pq.read_metadata('example.parquet')In [30]:metadataOut[30]:<pyarrow._parquet.FileMetaData object at 0x7fdffc943dd0>  created_by: parquet-cpp-arrow version 22.0.0  num_columns: 4  num_rows: 3  num_row_groups: 1  format_version: 2.6  serialized_size: 2652

The returnedFileMetaData object allows to inspect theParquet file metadata,such as the row groups and column chunk metadata and statistics:

In [31]:metadata.row_group(0)Out[31]:<pyarrow._parquet.RowGroupMetaData object at 0x7fe0be7c50d0>  num_columns: 4  num_rows: 3  total_byte_size: 290  sorting_columns: ()In [32]:metadata.row_group(0).column(0)Out[32]:<pyarrow._parquet.ColumnChunkMetaData object at 0x7fe0be7c4a90>  file_offset: 0  file_path:  physical_type: DOUBLE  num_values: 3  path_in_schema: one  is_stats_set: True  statistics:    <pyarrow._parquet.Statistics object at 0x7fe0be7c5cb0>      has_min_max: True      min: -1.0      max: 2.5      null_count: 1      distinct_count: None      num_values: 2      physical_type: DOUBLE      logical_type: None      converted_type (legacy): NONE  geo_statistics:    None  compression: SNAPPY  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')  has_dictionary_page: True  dictionary_page_offset: 4  data_page_offset: 36  total_compressed_size: 106  total_uncompressed_size: 102

Data Type Handling#

Reading types as DictionaryArray#

Theread_dictionary option inread_table andParquetDataset willcause columns to be read asDictionaryArray, which will becomepandas.Categorical when converted to pandas. This option is only valid forstring and binary column types, and it can yield significantly lower memory useand improved performance for columns with many repeated string values.

pq.read_table(table,where,read_dictionary=['binary_c0','stringb_c2'])

Storing timestamps#

Some Parquet readers may only support timestamps stored in millisecond('ms') or microsecond ('us') resolution. Since pandas uses nanosecondsto represent timestamps, this can occasionally be a nuisance. By default(when writing version 1.0 Parquet files), the nanoseconds will be cast tomicroseconds (‘us’).

In addition, We provide thecoerce_timestamps option to allow you to selectthe desired resolution:

pq.write_table(table,where,coerce_timestamps='ms')

If a cast to a lower resolution value may result in a loss of data, by defaultan exception will be raised. This can be suppressed by passingallow_truncated_timestamps=True:

pq.write_table(table,where,coerce_timestamps='ms',allow_truncated_timestamps=True)

Timestamps with nanoseconds can be stored without casting when using themore recent Parquet format version 2.6:

pq.write_table(table,where,version='2.6')

However, many Parquet readers do not yet support this newer format version, andtherefore the default is to write version 1.0 files. When compatibility acrossdifferent processing frameworks is required, it is recommended to use thedefault version 1.0.

Older Parquet implementations useINT96 based storage oftimestamps, but this is now deprecated. This includes some olderversions of Apache Impala and Apache Spark. To write timestamps inthis format, set theuse_deprecated_int96_timestamps option toTrue inwrite_table.

pq.write_table(table,where,use_deprecated_int96_timestamps=True)

Compression, Encoding, and File Compatibility#

The most commonly used Parquet implementations use dictionary encoding whenwriting files; if the dictionaries grow too large, then they “fall back” toplain encoding. Whether dictionary encoding is used can be toggled using theuse_dictionary option:

pq.write_table(table,where,use_dictionary=False)

The data pages within a column in a row group can be compressed after theencoding passes (dictionary, RLE encoding). In PyArrow we use Snappycompression by default, but Brotli, Gzip, ZSTD, LZ4, and uncompressed arealso supported:

pq.write_table(table,where,compression='snappy')pq.write_table(table,where,compression='gzip')pq.write_table(table,where,compression='brotli')pq.write_table(table,where,compression='zstd')pq.write_table(table,where,compression='lz4')pq.write_table(table,where,compression='none')

Snappy generally results in better performance, while Gzip may yield smallerfiles.

These settings can also be set on a per-column basis:

pq.write_table(table,where,compression={'foo':'snappy','bar':'gzip'},use_dictionary=['foo','bar'])

Partitioned Datasets (Multiple Files)#

Multiple Parquet files constitute a Parquetdataset. These may present in anumber of ways:

  • A list of Parquet absolute file paths

  • A directory name containing nested directories defining a partitioned dataset

A dataset partitioned by year and month may look like on disk:

dataset_name/  year=2007/    month=01/       0.parq       1.parq       ...    month=02/       0.parq       1.parq       ...    month=03/    ...  year=2008/    month=01/    ...  ...

Writing to Partitioned Datasets#

You can write a partitioned dataset for anypyarrow file system that is afile-store (e.g. local, HDFS, S3). The default behaviour when no filesystem isadded is to use the local filesystem.

# Local dataset writepq.write_to_dataset(table,root_path='dataset_name',partition_cols=['one','two'])

The root path in this case specifies the parent directory to which data will besaved. The partition columns are the column names by which to partition thedataset. Columns are partitioned in the order they are given. The partitionsplits are determined by the unique values in the partition columns.

To use another filesystem you only need to add the filesystem parameter, theindividual table writes are wrapped usingwith statements so thepq.write_to_dataset function does not need to be.

# Remote file-system examplefrompyarrow.fsimportHadoopFileSystemfs=HadoopFileSystem(host,port,user=user,kerb_ticket=ticket_cache_path)pq.write_to_dataset(table,root_path='dataset_name',partition_cols=['one','two'],filesystem=fs)

Compatibility Note: if usingpq.write_to_dataset to create a table thatwill then be used by HIVE then partition column values must be compatible withthe allowed character set of the HIVE version you are running.

Writing_metadata and_common_metadata files#

Some processing frameworks such as Spark or Dask (optionally) use_metadataand_common_metadata files with partitioned datasets.

Those files include information about the schema of the full dataset (for_common_metadata) and potentially all row group metadata of all files in thepartitioned dataset as well (for_metadata). The actual files aremetadata-only Parquet files. Note this is not a Parquet standard, but aconvention set in practice by those frameworks.

Using those files can give a more efficient creation of a parquet Dataset,since it can use the stored schema and file paths of all row groups,instead of inferring the schema and crawling the directories for all Parquetfiles (this is especially the case for filesystems where accessing filesis expensive).

Thewrite_to_dataset() function does not automaticallywrite such metadata files, but you can use it to gather the metadata andcombine and write them manually:

# Write a dataset and collect metadata information of all written filesmetadata_collector=[]pq.write_to_dataset(table,root_path,metadata_collector=metadata_collector)# Write the ``_common_metadata`` parquet file without row groups statisticspq.write_metadata(table.schema,root_path/'_common_metadata')# Write the ``_metadata`` parquet file with row groups statistics of all filespq.write_metadata(table.schema,root_path/'_metadata',metadata_collector=metadata_collector)

When not using thewrite_to_dataset() function, butwriting the individual files of the partitioned dataset usingwrite_table() orParquetWriter,themetadata_collector keyword can also be used to collect the FileMetaDataof the written files. In this case, you need to ensure to set the file pathcontained in the row group metadata yourself before combining the metadata, andthe schemas of all different files and collected FileMetaData objects should bethe same:

metadata_collector=[]pq.write_table(table1,root_path/"year=2017/data1.parquet",metadata_collector=metadata_collector)# set the file path relative to the root of the partitioned datasetmetadata_collector[-1].set_file_path("year=2017/data1.parquet")# combine and write the metadatametadata=metadata_collector[0]for_metainmetadata_collector[1:]:metadata.append_row_groups(_meta)metadata.write_metadata_file(root_path/"_metadata")# or use pq.write_metadata to combine and write in a single steppq.write_metadata(table1.schema,root_path/"_metadata",metadata_collector=metadata_collector)

Reading from Partitioned Datasets#

TheParquetDataset class accepts either a directory name or a listof file paths, and can discover and infer some common partition structures,such as those produced by Hive:

dataset=pq.ParquetDataset('dataset_name/')table=dataset.read()

You can also use the convenience functionread_table exposed bypyarrow.parquet that avoids the need for an additional Dataset objectcreation step.

table=pq.read_table('dataset_name')

Note: the partition columns in the original table will have their typesconverted to Arrow dictionary types (pandas categorical) on load. Ordering ofpartition columns is not preserved through the save/load process. If readingfrom a remote filesystem into a pandas dataframe you may need to runsort_index to maintain row ordering (as long as thepreserve_indexoption was enabled on write).

Other features:

  • Filtering on all columns (using row group statistics) instead of only onthe partition keys.

  • Fine-grained partitioning: support for a directory partitioning schemein addition to the Hive-like partitioning (e.g. “/2019/11/15/” instead of“/year=2019/month=11/day=15/”), and the ability to specify a schema forthe partition keys.

Note:

  • The partition keys need to be explicitly included in thecolumnskeyword when you want to include them in the result while reading asubset of the columns

Using with Spark#

Spark places some constraints on the types of Parquet files it will read. Theoptionflavor='spark' will set these options automatically and alsosanitize field characters unsupported by Spark SQL.

Multithreaded Reads#

Each of the reading functions by default use multi-threading for readingcolumns in parallel. Depending on the speed of IOand how expensive it is to decode the columns in a particular file(particularly with GZIP compression), this can yield significantly higher datathroughput.

This can be disabled by specifyinguse_threads=False.

Note

The number of threads to use concurrently is automatically inferred by Arrowand can be inspected using thecpu_count() function.

Reading from cloud storage#

In addition to local files, pyarrow supports other filesystems, such as cloudfilesystems, through thefilesystem keyword:

frompyarrowimportfss3=fs.S3FileSystem(region="us-east-2")table=pq.read_table("bucket/object/key/prefix",filesystem=s3)

Currently,HDFS andAmazonS3-compatiblestorage aresupported. See theFilesystem Interface docs for more details. For thosebuilt-in filesystems, the filesystem can also be inferred from the file path,if specified as a URI:

table=pq.read_table("s3://bucket/object/key/prefix")

Other filesystems can still be supported if there is anfsspec-compatibleimplementation available. SeeUsing fsspec-compatible filesystems with Arrow for more details.One example is Azure Blob storage, which can be interfaced through theadlfs package.

fromadlfsimportAzureBlobFileSystemabfs=AzureBlobFileSystem(account_name="XXXX",account_key="XXXX",container_name="XXXX")table=pq.read_table("file.parquet",filesystem=abfs)

Parquet Modular Encryption (Columnar Encryption)#

Columnar encryption is supported for Parquet files in C++ starting fromApache Arrow 4.0.0 and in PyArrow starting from Apache Arrow 6.0.0.

Parquet uses the envelope encryption practice, where file parts are encryptedwith “data encryption keys” (DEKs), and the DEKs are encrypted with “masterencryption keys” (MEKs). The DEKs are randomly generated by Parquet for eachencrypted file/column. The MEKs are generated, stored and managed in a KeyManagement Service (KMS) of user’s choice.

Reading and writing encrypted Parquet files involves passing file encryptionand decryption properties toParquetWriter and toParquetFile, respectively.

Writing an encrypted Parquet file:

encryption_properties=crypto_factory.file_encryption_properties(kms_connection_config,encryption_config)withpq.ParquetWriter(filename,schema,encryption_properties=encryption_properties)aswriter:writer.write_table(table)

Reading an encrypted Parquet file:

decryption_properties=crypto_factory.file_decryption_properties(kms_connection_config)parquet_file=pq.ParquetFile(filename,decryption_properties=decryption_properties)

In order to create the encryption and decryption properties, apyarrow.parquet.encryption.CryptoFactory should be created andinitialized with KMS Client details, as described below.

KMS Client#

The master encryption keys should be kept and managed in a production-gradeKey Management System (KMS), deployed in the user’s organization. Using Parquetencryption requires implementation of a client class for the KMS server.Any KmsClient implementation should implement the informal interfacedefined bypyarrow.parquet.encryption.KmsClient as following:

importpyarrow.parquet.encryptionaspeclassMyKmsClient(pe.KmsClient):"""An example KmsClient implementation skeleton"""def__init__(self,kms_connection_configuration):pe.KmsClient.__init__(self)# Any KMS-specific initialization based on# kms_connection_configuration comes heredefwrap_key(self,key_bytes,master_key_identifier):wrapped_key=...# call KMS to wrap key_bytes with key specified by# master_key_identifierreturnwrapped_keydefunwrap_key(self,wrapped_key,master_key_identifier):key_bytes=...# call KMS to unwrap wrapped_key with key specified by# master_key_identifierreturnkey_bytes

The concrete implementation will be loaded at runtime by a factory functionprovided by the user. This factory function will be used to initialize thepyarrow.parquet.encryption.CryptoFactory for creating file encryptionand decryption properties.

For example, in order to use theMyKmsClient defined above:

defkms_client_factory(kms_connection_configuration):returnMyKmsClient(kms_connection_configuration)crypto_factory=CryptoFactory(kms_client_factory)

Anexampleof such a class for an open sourceKMS can be found in the ApacheArrow GitHub repository. The production KMS client should be designed incooperation with an organization’s security administrators, and built bydevelopers with experience in access control management. Once such a class iscreated, it can be passed to applications via a factory method and leveragedby general PyArrow users as shown in the encrypted parquet write/read sampleabove.

KMS connection configuration#

Configuration of connection to KMS (pyarrow.parquet.encryption.KmsConnectionConfigused when creating file encryption and decryption properties) includes thefollowing options:

  • kms_instance_url, URL of the KMS instance.

  • kms_instance_id, ID of the KMS instance that will be used for encryption(if multiple KMS instances are available).

  • key_access_token, authorization token that will be passed to KMS.

  • custom_kms_conf, a string dictionary with KMS-type-specific configuration.

Encryption configuration#

pyarrow.parquet.encryption.EncryptionConfiguration (used whencreating file encryption properties) includes the following options:

  • footer_key, the ID of the master key for footer encryption/signing.

  • column_keys, which columns to encrypt with which key. Dictionary withmaster key IDs as the keys, and column name lists as the values,e.g.{key1:[col1,col2],key2:[col3]}. See notes on nested fields below.

  • encryption_algorithm, the Parquet encryption algorithm.Can beAES_GCM_V1 (default) orAES_GCM_CTR_V1.

  • plaintext_footer, whether to write the file footer in plain text (otherwise it is encrypted).

  • double_wrapping, whether to use double wrapping - where data encryption keys (DEKs)are encrypted with key encryption keys (KEKs), which in turn are encryptedwith master encryption keys (MEKs). If set tofalse, single wrapping isused - where DEKs are encrypted directly with MEKs.

  • cache_lifetime, the lifetime of cached entities (key encryption keys,local wrapping keys, KMS client objects) represented as adatetime.timedelta.

  • internal_key_material, whether to store key material inside Parquet file footers;this mode doesn’t produce additional files. If set tofalse, key material isstored in separate files in the same folder, which enables key rotation forimmutable Parquet files.

  • data_key_length_bits, the length of data encryption keys (DEKs), randomlygenerated by Parquet key management tools. Can be 128, 192 or 256 bits.

Note

Whendouble_wrapping is true, Parquet implements a “double envelopeencryption” mode that minimizes the interaction of the program with a KMSserver. In this mode, the DEKs are encrypted with “key encryption keys”(KEKs, randomly generated by Parquet). The KEKs are encrypted with “masterencryption keys” (MEKs) in the KMS; the result and the KEK itself arecached in the process memory.

An example encryption configuration:

encryption_config=pq.EncryptionConfiguration(footer_key="footer_key_name",column_keys={"column_key_name":["Column1","Column2"],},)

Note

Encrypting columns that have nested fields (struct, map or list data types)requires column keys for the inner fields, not the outer column itself.Configuring a column key for the outer column causesthis error (here the column name iscol):

OSError:Encryptedcolumncolnotinfileschema

An example encryption configuration for columns with nested fields, whereall columns will be encrypted with the same key identified bycolumn_key_id:

importpyarrow.parquet.encryptionaspeschema=pa.schema([("ListColumn",pa.list_(pa.int32())),("MapColumn",pa.map_(pa.string(),pa.int32())),("StructColumn",pa.struct([("f1",pa.int32()),("f2",pa.string())])),])encryption_config=pe.EncryptionConfiguration(footer_key="footer_key_name",column_keys={"column_key_id":["ListColumn.list.element","MapColumn.key_value.key","MapColumn.key_value.value","StructColumn.f1","StructColumn.f2"],},)

Decryption configuration#

pyarrow.parquet.encryption.DecryptionConfiguration (used when creatingfile decryption properties) is optional and it includes the following options:

  • cache_lifetime, the lifetime of cached entities (key encryption keys, localwrapping keys, KMS client objects) represented as adatetime.timedelta.

Content-Defined Chunking#

Note

This feature is experimental and may change in future releases.

PyArrow introduces an experimental feature for optimizing Parquet files for contentaddressable storage (CAS) systems using content-defined chunking (CDC). This featureenables efficient deduplication of data across files, improving network transfers andstorage efficiency.

When enabled, data pages are written according to content-defined chunk boundaries,determined by a rolling hash algorithm that identifies chunk boundaries based on theactual content of the data. When data in a column is modified (e.g., inserted, deleted,or updated), this approach minimizes the number of changed data pages.

The feature can be enabled by setting theuse_content_defined_chunking parameter inthe Parquet writer. It accepts either a boolean or a dictionary for configuration:

  • True: Uses the default configuration with:
    • Minimum chunk size: 256 KiB

    • Maximum chunk size: 1024 KiB

    • Normalization level: 0

  • dict: Allows customization of the chunking parameters:
    • min_chunk_size: Minimum chunk size in bytes (default: 256 KiB).

    • max_chunk_size: Maximum chunk size in bytes (default: 1024 KiB).

    • norm_level: Normalization level to adjust chunk size distribution (default: 0).

Note that the chunk size is calculated on the logical values before applying any encodingor compression. The actual size of the data pages may vary based on the encoding andcompression used.

Note

To make the most of this feature, you should ensure that Parquet write optionsremain consistent across writes and files.Using different write options (like compression, encoding, or row group size)for different files may prevent proper deduplication and lead to suboptimalstorage efficiency.

importpyarrowaspaimportpyarrow.parquetasptable=pa.Table.from_pandas(df)# Enable content-defined chunking with default settingspq.write_table(table,'example.parquet',use_content_defined_chunking=True)# Enable content-defined chunking with custom settingspq.write_table(table,'example_custom.parquet',use_content_defined_chunking={'min_chunk_size':128*1024,# 128 KiB'max_chunk_size':512*1024,# 512 KiB})