pyarrow.parquet.write_table#
- pyarrow.parquet.write_table(table,where,row_group_size=None,version='2.6',use_dictionary=True,compression='snappy',write_statistics=True,use_deprecated_int96_timestamps=None,coerce_timestamps=None,allow_truncated_timestamps=False,data_page_size=None,flavor=None,filesystem=None,compression_level=None,use_byte_stream_split=False,column_encoding=None,data_page_version='1.0',use_compliant_nested_type=True,encryption_properties=None,write_batch_size=None,dictionary_pagesize_limit=None,store_schema=True,write_page_index=False,write_page_checksum=False,sorting_columns=None,store_decimal_as_integer=False,write_time_adjusted_to_utc=False,max_rows_per_page=None,**kwargs)[source]#
Write a Table to Parquet format.
- Parameters:
- table
pyarrow.Table - where
strorpyarrow.NativeFile - row_group_size
int, defaultNone Maximum number of rows in each written row group. If None, therow group size will be the minimum of the Table size (in rows)and 1024 * 1024. If set larger than 64 * 1024 * 1024 then64 * 1024 * 1024 will be used instead.
- version{“1.0”, “2.4”, “2.6”}, default “2.6”
Determine which Parquet logical types are available for use, whether thereduced set from the Parquet 1.x.x format or the expanded logical typesadded in later format versions.Files written with version=’2.4’ or ‘2.6’ may not be readable in allParquet implementations, so version=’1.0’ is likely the choice thatmaximizes file compatibility.UINT32 and some logical types are only available with version ‘2.4’.Nanosecond timestamps are only available with version ‘2.6’.Other features such as compression algorithms or the new serializeddata page format must be enabled separately (see ‘compression’ and‘data_page_version’).
- use_dictionarybool or
list, defaultTrue Specify if we should use dictionary encoding in general or only forsome columns.When encoding the column, if the dictionary size is too large, thecolumn will fallback to
PLAINencoding. Specially,BOOLEANtypedoesn’t support dictionary encoding.- compression
strordict, default ‘snappy’ Specify the compression codec, either on a general basis or per-column.Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’}.
- write_statisticsbool or
list, defaultTrue Specify if we should write statistics in general (default is True) or onlyfor some columns.
- use_deprecated_int96_timestampsbool, default
None Write timestamps to INT96 Parquet format. Defaults to False unless enabledby flavor argument. This take priority over the coerce_timestamps option.
- coerce_timestamps
str, defaultNone Cast timestamps to a particular resolution. If omitted, defaults are chosendepending onversion. For
version='1.0'andversion='2.4',nanoseconds are cast to microseconds (‘us’), while forversion='2.6'(the default), they are written natively without lossof resolution. Seconds are always cast to milliseconds (‘ms’) by default,as Parquet does not have any temporal type with seconds resolution.If the casting results in loss of data, it will raise an exceptionunlessallow_truncated_timestamps=Trueis given.Valid values: {None, ‘ms’, ‘us’}- allow_truncated_timestampsbool, default
False Allow loss of data when coercing timestamps to a particularresolution. E.g. if microsecond or nanosecond data is lost when coercing to‘ms’, do not raise an exception. Passing
allow_truncated_timestamp=Truewill NOT result in the truncation exception being ignored unlesscoerce_timestampsis not None.- data_page_size
int, defaultNone Set a target threshold for the approximate encoded size of datapages within a column chunk (in bytes). If None, use the default data pagesize of 1MByte.
- max_rows_per_page
int, defaultNone Maximum number of rows per page within a column chunk.If None, use the default of 20000.Smaller values reduce memory usage during reads but increase metadata overhead.
- flavor{‘spark’}, default
None Sanitize schema or set other compatibility options to work withvarious target systems.
- filesystem
FileSystem, defaultNone If nothing passed, will be inferred fromwhere if path-like, elsewhere is already a file-like object so no filesystem is needed.
- compression_level
intordict, defaultNone Specify the compression level for a codec, either on a general basis orper-column. If None is passed, arrow selects the compression level forthe compression codec in use. The compression level has a differentmeaning for each codec, so you have to read the documentation of thecodec you are using.An exception is thrown if the compression codec does not allow specifyinga compression level.
- use_byte_stream_splitbool or
list, defaultFalse Specify if the byte_stream_split encoding should be used in general oronly for some columns. If both dictionary and byte_stream_stream areenabled, then dictionary is preferred.The byte_stream_split encoding is valid for integer, floating-pointand fixed-size binary data types (including decimals); it should becombined with a compression codec so as to achieve size reduction.
- column_encoding
strordict, defaultNone Specify the encoding scheme on a per column basis.Can only be used when
use_dictionaryis set to False, andcannot be used in combination withuse_byte_stream_split.Currently supported values: {‘PLAIN’, ‘BYTE_STREAM_SPLIT’,‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’}.Certain encodings are only compatible with certain data types.Please refer to the encodings section ofReading and writing Parquetfiles.- data_page_version{“1.0”, “2.0”}, default “1.0”
The serialized Parquet data page format version to write, defaults to1.0. This does not impact the file schema logical types and Arrow toParquet type casting behavior; for that use the “version” option.
- use_compliant_nested_typebool, default
True Whether to write compliant Parquet nested type (lists) as definedhere, defaults to
True.Foruse_compliant_nested_type=True, this will write into a listwith 3-level structure where the middle level, namedlist,is a repeated group with a single field namedelement:<list-repetition>group<name>(LIST){repeatedgrouplist{<element-repetition><element-type>element;}}
For
use_compliant_nested_type=False, this will also write into a listwith 3-level structure, where the name of the single field of the middlelevellistis taken from the element name for nested columns in Arrow,which defaults toitem:<list-repetition>group<name>(LIST){repeatedgrouplist{<element-repetition><element-type>item;}}
- encryption_properties
FileEncryptionProperties, defaultNone File encryption properties for Parquet Modular Encryption.If None, no encryption will be done.The encryption properties can be created using:
CryptoFactory.file_encryption_properties().- write_batch_size
int, defaultNone Number of values to write to a page at a time. If None, use the default of1024.
write_batch_sizeis complementary todata_page_size. If pagesare exceeding thedata_page_sizedue to large column values, loweringthe batch size can help keep page sizes closer to the intended size.- dictionary_pagesize_limit
int, defaultNone Specify the dictionary page size limit per row group. If None, use thedefault 1MB.
- store_schemabool, default
True By default, the Arrow schema is serialized and stored in the Parquetfile metadata (in the “ARROW:schema” key). When reading the file,if this key is available, it will be used to more faithfully recreatethe original Arrow data. For example, for tz-aware timestamp columnsit will restore the timezone (Parquet only stores the UTC values withouttimezone), or columns with duration type will be restored from the int64Parquet column.
- write_page_indexbool, default
False Whether to write a page index in general for all columns.Writing statistics to the page index disables the old method of writingstatistics to each data page header. The page index makes statistics-basedfiltering more efficient than the page header, as it gathers all thestatistics for a Parquet file in a single place, avoiding scattered I/O.Note that the page index is not yet used on the read size by PyArrow.
- write_page_checksumbool, default
False Whether to write page checksums in general for all columns.Page checksums enable detection of data corruption, which might occur duringtransmission or in the storage.
- sorting_columns
SequenceofSortingColumn, defaultNone Specify the sort order of the data being written. The writer does not sortthe data nor does it verify that the data is sorted. The sort order iswritten to the row group metadata, which can then be used by readers.
- store_decimal_as_integerbool, default
False Allow decimals with 1 <= precision <= 18 to be stored as integers.In Parquet, DECIMAL can be stored in any of the following physical types:- int32: for 1 <= precision <= 9.- int64: for 10 <= precision <= 18.- fixed_len_byte_array: precision is limited by the array size.
Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
binary: precision is unlimited. The minimum number of bytes to store theunscaled value is used.
By default, this is DISABLED and all decimal types annotate fixed_len_byte_array.When enabled, the writer will use the following physical types to store decimals:- int32: for 1 <= precision <= 9.- int64: for 10 <= precision <= 18.- fixed_len_byte_array: for precision > 18.
As a consequence, decimal columns stored in integer types are more compact.
- use_content_defined_chunkingbool or
dict, defaultFalse Optimize parquet files for content addressable storage (CAS) systems by writingdata pages according to content-defined chunk boundaries. This allows for moreefficient deduplication of data across files, hence more efficient networktransfers and storage. The chunking is based on a rolling hash algorithm thatidentifies chunk boundaries based on the actual content of the data.
Note that it is an experimental feature and the API may change in the future.
If set to
True, a default configuration is used withmin_chunk_size=256 KiBandmax_chunk_size=1024 KiB. The chunk size distribution approximates a normaldistribution betweenmin_chunk_size andmax_chunk_size (sizes are accountedbefore any Parquet encodings).Adict can be passed to adjust the chunker parameters with the following keys:-min_chunk_size: minimum chunk size in bytes, default 256 KiB
The rolling hash will not be updated until this size is reached for each chunk.Note that all data sent through the hash function is counted towards the chunksize, including definition and repetition levels if present.
max_chunk_size: maximum chunk size in bytes, default is 1024 KiBThe chunker will create a new chunk whenever the chunk size exceeds this value.Note that the parquet writer has a relateddata_pagesize property that controlsthe maximum size of a parquet data page after encoding. While settingdata_page_size to a smaller value thanmax_chunk_size doesn’t affect thechunking effectiveness, it results in more small parquet data pages.
norm_level: normalization level to center the chunk size around the averagesize more aggressively, default 0Increasing the normalization level increases the probability of finding a chunk,improving the deduplication ratio, but also increasing the number of small chunksresulting in many small parquet data pages. The default value provides a goodbalance between deduplication ratio and fragmentation. Use norm_level=1 ornorm_level=2 to reach a higher deduplication ratio at the expense offragmentation.
- write_time_adjusted_to_utcbool, default
False Set the value of isAdjustedTOUTC when writing a TIME column.If True, this tells the Parquet reader that the TIME columnsare expressed in reference to midnight in the UTC timezone.If False (the default), the TIME columns are assumed to be expressedin reference to midnight in an unknown, presumably local, timezone.
- **kwargsoptional
Additional options for ParquetWriter
- table
Examples
Generate an example PyArrow Table:
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})
and write the Table into Parquet file:
>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')
Defining row group size for the Parquet file:
>>>pq.write_table(table,'example.parquet',row_group_size=3)
Defining row group compression (default is Snappy):
>>>pq.write_table(table,'example.parquet',compression='none')
Defining row group compression and encoding per-column:
>>>pq.write_table(table,'example.parquet',...compression={'n_legs':'snappy','animal':'gzip'},...use_dictionary=['n_legs','animal'])
Defining column encoding per-column:
>>>pq.write_table(table,'example.parquet',...column_encoding={'animal':'PLAIN'},...use_dictionary=False)

