pyarrow.parquet.write_table #

pyarrow.parquet.write_table(table,where,row_group_size=None,version='2.6',use_dictionary=True,compression='snappy',write_statistics=True,use_deprecated_int96_timestamps=None,coerce_timestamps=None,allow_truncated_timestamps=False,data_page_size=None,flavor=None,filesystem=None,compression_level=None,use_byte_stream_split=False,column_encoding=None,data_page_version='1.0',use_compliant_nested_type=True,encryption_properties=None,write_batch_size=None,dictionary_pagesize_limit=None,store_schema=True,write_page_index=False,write_page_checksum=False,sorting_columns=None,store_decimal_as_integer=False,write_time_adjusted_to_utc=False,max_rows_per_page=None,**kwargs)[source]#

Write a Table to Parquet format.

Parameters:

tablepyarrow.Table

wherestr orpyarrow.NativeFile

row_group_sizeint, defaultNone

Maximum number of rows in each written row group. If None, therow group size will be the minimum of the Table size (in rows)and 1024 * 1024. If set larger than 64 * 1024 * 1024 then64 * 1024 * 1024 will be used instead.

version{“1.0”, “2.4”, “2.6”}, default “2.6”

Determine which Parquet logical types are available for use, whether thereduced set from the Parquet 1.x.x format or the expanded logical typesadded in later format versions.Files written with version=’2.4’ or ‘2.6’ may not be readable in allParquet implementations, so version=’1.0’ is likely the choice thatmaximizes file compatibility.UINT32 and some logical types are only available with version ‘2.4’.Nanosecond timestamps are only available with version ‘2.6’.Other features such as compression algorithms or the new serializeddata page format must be enabled separately (see ‘compression’ and‘data_page_version’).

use_dictionarybool orlist, defaultTrue

Specify if we should use dictionary encoding in general or only forsome columns.When encoding the column, if the dictionary size is too large, thecolumn will fallback toPLAIN encoding. Specially,BOOLEAN typedoesn’t support dictionary encoding.

compressionstr ordict, default ‘snappy’

Specify the compression codec, either on a general basis or per-column.Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’}.

write_statisticsbool orlist, defaultTrue

Specify if we should write statistics in general (default is True) or onlyfor some columns.

use_deprecated_int96_timestampsbool, defaultNone

Write timestamps to INT96 Parquet format. Defaults to False unless enabledby flavor argument. This take priority over the coerce_timestamps option.

coerce_timestampsstr, defaultNone

Cast timestamps to a particular resolution. If omitted, defaults are chosendepending onversion. Forversion='1.0' andversion='2.4',nanoseconds are cast to microseconds (‘us’), while forversion='2.6' (the default), they are written natively without lossof resolution. Seconds are always cast to milliseconds (‘ms’) by default,as Parquet does not have any temporal type with seconds resolution.If the casting results in loss of data, it will raise an exceptionunlessallow_truncated_timestamps=True is given.Valid values: {None, ‘ms’, ‘us’}

allow_truncated_timestampsbool, defaultFalse

Allow loss of data when coercing timestamps to a particularresolution. E.g. if microsecond or nanosecond data is lost when coercing to‘ms’, do not raise an exception. Passingallow_truncated_timestamp=Truewill NOT result in the truncation exception being ignored unlesscoerce_timestamps is not None.

data_page_sizeint, defaultNone

Set a target threshold for the approximate encoded size of datapages within a column chunk (in bytes). If None, use the default data pagesize of 1MByte.

max_rows_per_pageint, defaultNone

Maximum number of rows per page within a column chunk.If None, use the default of 20000.Smaller values reduce memory usage during reads but increase metadata overhead.

flavor{‘spark’}, defaultNone

Sanitize schema or set other compatibility options to work withvarious target systems.

filesystemFileSystem, defaultNone

If nothing passed, will be inferred fromwhere if path-like, elsewhere is already a file-like object so no filesystem is needed.

compression_levelint ordict, defaultNone

Specify the compression level for a codec, either on a general basis orper-column. If None is passed, arrow selects the compression level forthe compression codec in use. The compression level has a differentmeaning for each codec, so you have to read the documentation of thecodec you are using.An exception is thrown if the compression codec does not allow specifyinga compression level.

use_byte_stream_splitbool orlist, defaultFalse

Specify if the byte_stream_split encoding should be used in general oronly for some columns. If both dictionary and byte_stream_stream areenabled, then dictionary is preferred.The byte_stream_split encoding is valid for integer, floating-pointand fixed-size binary data types (including decimals); it should becombined with a compression codec so as to achieve size reduction.

column_encodingstr ordict, defaultNone

Specify the encoding scheme on a per column basis.Can only be used whenuse_dictionary is set to False, andcannot be used in combination withuse_byte_stream_split.Currently supported values: {‘PLAIN’, ‘BYTE_STREAM_SPLIT’,‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’}.Certain encodings are only compatible with certain data types.Please refer to the encodings section ofReading and writing Parquetfiles.

data_page_version{“1.0”, “2.0”}, default “1.0”

The serialized Parquet data page format version to write, defaults to1.0. This does not impact the file schema logical types and Arrow toParquet type casting behavior; for that use the “version” option.

use_compliant_nested_typebool, defaultTrue

Whether to write compliant Parquet nested type (lists) as definedhere, defaults toTrue.Foruse_compliant_nested_type=True, this will write into a listwith 3-level structure where the middle level, namedlist,is a repeated group with a single field namedelement:

<list-repetition>group<name>(LIST){repeatedgrouplist{<element-repetition><element-type>element;}}

Foruse_compliant_nested_type=False, this will also write into a listwith 3-level structure, where the name of the single field of the middlelevellist is taken from the element name for nested columns in Arrow,which defaults toitem:

<list-repetition>group<name>(LIST){repeatedgrouplist{<element-repetition><element-type>item;}}

encryption_propertiesFileEncryptionProperties, defaultNone

File encryption properties for Parquet Modular Encryption.If None, no encryption will be done.The encryption properties can be created using:CryptoFactory.file_encryption_properties().

write_batch_sizeint, defaultNone

Number of values to write to a page at a time. If None, use the default of1024.write_batch_size is complementary todata_page_size. If pagesare exceeding thedata_page_size due to large column values, loweringthe batch size can help keep page sizes closer to the intended size.

dictionary_pagesize_limitint, defaultNone

Specify the dictionary page size limit per row group. If None, use thedefault 1MB.

store_schemabool, defaultTrue

By default, the Arrow schema is serialized and stored in the Parquetfile metadata (in the “ARROW:schema” key). When reading the file,if this key is available, it will be used to more faithfully recreatethe original Arrow data. For example, for tz-aware timestamp columnsit will restore the timezone (Parquet only stores the UTC values withouttimezone), or columns with duration type will be restored from the int64Parquet column.

write_page_indexbool, defaultFalse

Whether to write a page index in general for all columns.Writing statistics to the page index disables the old method of writingstatistics to each data page header. The page index makes statistics-basedfiltering more efficient than the page header, as it gathers all thestatistics for a Parquet file in a single place, avoiding scattered I/O.Note that the page index is not yet used on the read size by PyArrow.

write_page_checksumbool, defaultFalse

Whether to write page checksums in general for all columns.Page checksums enable detection of data corruption, which might occur duringtransmission or in the storage.

sorting_columnsSequence ofSortingColumn, defaultNone

Specify the sort order of the data being written. The writer does not sortthe data nor does it verify that the data is sorted. The sort order iswritten to the row group metadata, which can then be used by readers.

store_decimal_as_integerbool, defaultFalse

Allow decimals with 1 <= precision <= 18 to be stored as integers.In Parquet, DECIMAL can be stored in any of the following physical types:- int32: for 1 <= precision <= 9.- int64: for 10 <= precision <= 18.- fixed_len_byte_array: precision is limited by the array size.

Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.

binary: precision is unlimited. The minimum number of bytes to store theunscaled value is used.

By default, this is DISABLED and all decimal types annotate fixed_len_byte_array.When enabled, the writer will use the following physical types to store decimals:- int32: for 1 <= precision <= 9.- int64: for 10 <= precision <= 18.- fixed_len_byte_array: for precision > 18.

As a consequence, decimal columns stored in integer types are more compact.

use_content_defined_chunkingbool ordict, defaultFalse

Optimize parquet files for content addressable storage (CAS) systems by writingdata pages according to content-defined chunk boundaries. This allows for moreefficient deduplication of data across files, hence more efficient networktransfers and storage. The chunking is based on a rolling hash algorithm thatidentifies chunk boundaries based on the actual content of the data.

Note that it is an experimental feature and the API may change in the future.

If set toTrue, a default configuration is used withmin_chunk_size=256 KiBandmax_chunk_size=1024 KiB. The chunk size distribution approximates a normaldistribution betweenmin_chunk_size andmax_chunk_size (sizes are accountedbefore any Parquet encodings).

Adict can be passed to adjust the chunker parameters with the following keys:-min_chunk_size: minimum chunk size in bytes, default 256 KiB

The rolling hash will not be updated until this size is reached for each chunk.Note that all data sent through the hash function is counted towards the chunksize, including definition and repetition levels if present.

max_chunk_size: maximum chunk size in bytes, default is 1024 KiBThe chunker will create a new chunk whenever the chunk size exceeds this value.Note that the parquet writer has a relateddata_pagesize property that controlsthe maximum size of a parquet data page after encoding. While settingdata_page_size to a smaller value thanmax_chunk_size doesn’t affect thechunking effectiveness, it results in more small parquet data pages.
norm_level: normalization level to center the chunk size around the averagesize more aggressively, default 0Increasing the normalization level increases the probability of finding a chunk,improving the deduplication ratio, but also increasing the number of small chunksresulting in many small parquet data pages. The default value provides a goodbalance between deduplication ratio and fragmentation. Use norm_level=1 ornorm_level=2 to reach a higher deduplication ratio at the expense offragmentation.

write_time_adjusted_to_utcbool, defaultFalse

Set the value of isAdjustedTOUTC when writing a TIME column.If True, this tells the Parquet reader that the TIME columnsare expressed in reference to midnight in the UTC timezone.If False (the default), the TIME columns are assumed to be expressedin reference to midnight in an unknown, presumably local, timezone.

**kwargsoptional

Additional options for ParquetWriter

Examples

Generate an example PyArrow Table:

>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})

and write the Table into Parquet file:

>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')

Defining row group size for the Parquet file:

>>>pq.write_table(table,'example.parquet',row_group_size=3)

Defining row group compression (default is Snappy):

>>>pq.write_table(table,'example.parquet',compression='none')

Defining row group compression and encoding per-column:

>>>pq.write_table(table,'example.parquet',...compression={'n_legs':'snappy','animal':'gzip'},...use_dictionary=['n_legs','animal'])

Defining column encoding per-column:

>>>pq.write_table(table,'example.parquet',...column_encoding={'animal':'PLAIN'},...use_dictionary=False)

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.parquet.write_table#

pyarrow.parquet.write_table #