pyarrow.parquet.ParquetFile#
- classpyarrow.parquet.ParquetFile(source,*,metadata=None,common_metadata=None,read_dictionary=None,binary_type=None,list_type=None,memory_map=False,buffer_size=0,pre_buffer=False,coerce_int96_timestamp_unit=None,decryption_properties=None,thrift_string_size_limit=None,thrift_container_size_limit=None,filesystem=None,page_checksum_verification=False,arrow_extensions_enabled=True)[source]#
Bases:
objectReader interface for a single Parquet file.
- Parameters:
- source
str,pathlib.Path,pyarrow.NativeFile, or file-like object Readable source. For passing bytes or buffer-like file containing aParquet file, use pyarrow.BufferReader.
- metadata
FileMetaData, defaultNone Use existing metadata object, rather than reading from file.
- common_metadata
FileMetaData, defaultNone Will be used in reads for pandas schema metadata if not found in themain file’s metadata, no other uses at the moment.
- read_dictionary
list List of column names to read directly as DictionaryArray.
- binary_type
pyarrow.DataType, defaultNone If given, Parquet binary columns will be read as this datatype.This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.
- list_type
subclassofpyarrow.DataType, defaultNone If given, non-MAP repeated columns will be read as an instance ofthis datatype (either pyarrow.ListType or pyarrow.LargeListType).This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.
- memory_mapbool, default
False If the source is a file path, use a memory map to read file, which canimprove performance in some environments.
- buffer_size
int, default 0 If positive, perform read buffering when deserializing individualcolumn chunks. Otherwise IO calls are unbuffered.
- pre_bufferbool, default
False Coalesce and issue file reads in parallel to improve performance onhigh-latency filesystems (e.g. S3). If True, Arrow will use abackground I/O thread pool.
- coerce_int96_timestamp_unit
str, defaultNone Cast timestamps that are stored in INT96 format to a particularresolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’and therefore INT96 timestamps will be inferred as timestampsin nanoseconds.
- decryption_properties
FileDecryptionProperties, defaultNone File decryption properties for Parquet Modular Encryption.
- thrift_string_size_limit
int, defaultNone If not None, override the maximum total string size allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.
- thrift_container_size_limit
int, defaultNone If not None, override the maximum total size of containers allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.
- filesystem
FileSystem, defaultNone If nothing passed, will be inferred based on path.Path will try to be found in the local on-disk filesystem otherwiseit will be parsed as an URI to determine the filesystem.
- page_checksum_verificationbool, default
False If True, verify the checksum for each page read from the file.
- arrow_extensions_enabledbool, default
True If True, read Parquet logical types as Arrow extension types where possible,(e.g., read JSON as the canonicalarrow.json extension type or UUID asthe canonicalarrow.uuid extension type).
- source
Examples
Generate an example PyArrow Table and write it to Parquet file:
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})
>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')
Create a
ParquetFileobject from the Parquet file:>>>parquet_file=pq.ParquetFile('example.parquet')
Read the data:
>>>parquet_file.read()pyarrow.Tablen_legs: int64animal: string----n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Create a ParquetFile object with “animal” column as DictionaryArray:
>>>parquet_file=pq.ParquetFile('example.parquet',...read_dictionary=["animal"])>>>parquet_file.read()pyarrow.Tablen_legs: int64animal: dictionary<values=string, indices=int32, ordered=0>----n_legs: [[2,2,4,4,5,100]]animal: [ -- dictionary:["Flamingo","Parrot",...,"Brittle stars","Centipede"] -- indices:[0,1,2,3,4,5]]
- __init__(source,*,metadata=None,common_metadata=None,read_dictionary=None,binary_type=None,list_type=None,memory_map=False,buffer_size=0,pre_buffer=False,coerce_int96_timestamp_unit=None,decryption_properties=None,thrift_string_size_limit=None,thrift_container_size_limit=None,filesystem=None,page_checksum_verification=False,arrow_extensions_enabled=True)[source]#
Methods
__init__(source, *[, metadata, ...])close([force])iter_batches([batch_size, row_groups, ...])Read streaming batches from a Parquet file.
read([columns, use_threads, use_pandas_metadata])Read a Table from Parquet format.
read_row_group(i[, columns, use_threads, ...])Read a single row group from a Parquet file.
read_row_groups(row_groups[, columns, ...])Read a multiple row groups from a Parquet file.
scan_contents([columns, batch_size])Read contents of file for the given columns and batch size.
Attributes
Return the Parquet metadata.
Return the number of row groups of the Parquet file.
Return the Parquet schema, unconverted to Arrow types
Return the inferred Arrow schema, converted from the whole Parquet file's schema
- iter_batches(batch_size=65536,row_groups=None,columns=None,use_threads=True,use_pandas_metadata=False)[source]#
Read streaming batches from a Parquet file.
- Parameters:
- batch_size
int, default 64K Maximum number of records to yield per batch. Batches may besmaller if there aren’t enough rows in the file.
- row_groups
list Only these row groups will be read from the file.
- columns
list If not None, only these columns will be read from the file. Acolumn name may be a prefix of a nested field, e.g. ‘a’ will select‘a.b’, ‘a.c’, and ‘a.d.e’.
- use_threadsbool, default
True Perform multi-threaded column reads.
- use_pandas_metadatabool, default
False If True and file has custom pandas schema metadata, ensure thatindex columns are also loaded.
- batch_size
- Yields:
pyarrow.RecordBatchContents of each batch as a record batch
Examples
Generate an example Parquet file:
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')>>>foriinparquet_file.iter_batches():...print("RecordBatch")...print(i.to_pandas())...RecordBatch n_legs animal0 2 Flamingo1 2 Parrot2 4 Dog3 4 Horse4 5 Brittle stars5 100 Centipede
- propertymetadata#
Return the Parquet metadata.
- propertynum_row_groups#
Return the number of row groups of the Parquet file.
Examples
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
>>>parquet_file.num_row_groups1
- read(columns=None,use_threads=True,use_pandas_metadata=False)[source]#
Read a Table from Parquet format.
- Parameters:
- columns
list If not None, only these columns will be read from the file. Acolumn name may be a prefix of a nested field, e.g. ‘a’ will select‘a.b’, ‘a.c’, and ‘a.d.e’.
- use_threadsbool, default
True Perform multi-threaded column reads.
- use_pandas_metadatabool, default
False If True and file has custom pandas schema metadata, ensure thatindex columns are also loaded.
- columns
- Returns:
pyarrow.table.TableContent of the file as a table (of columns).
Examples
Generate an example Parquet file:
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
Read a Table:
>>>parquet_file.read(columns=["animal"])pyarrow.Tableanimal: string----animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
- read_row_group(i,columns=None,use_threads=True,use_pandas_metadata=False)[source]#
Read a single row group from a Parquet file.
- Parameters:
- i
int Index of the individual row group that we want to read.
- columns
list If not None, only these columns will be read from the row group. Acolumn name may be a prefix of a nested field, e.g. ‘a’ will select‘a.b’, ‘a.c’, and ‘a.d.e’.
- use_threadsbool, default
True Perform multi-threaded column reads.
- use_pandas_metadatabool, default
False If True and file has custom pandas schema metadata, ensure thatindex columns are also loaded.
- i
- Returns:
pyarrow.table.TableContent of the row group as a table (of columns)
Examples
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
>>>parquet_file.read_row_group(0)pyarrow.Tablen_legs: int64animal: string----n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
- read_row_groups(row_groups,columns=None,use_threads=True,use_pandas_metadata=False)[source]#
Read a multiple row groups from a Parquet file.
- Parameters:
- row_groups
list Only these row groups will be read from the file.
- columns
list If not None, only these columns will be read from the row group. Acolumn name may be a prefix of a nested field, e.g. ‘a’ will select‘a.b’, ‘a.c’, and ‘a.d.e’.
- use_threadsbool, default
True Perform multi-threaded column reads.
- use_pandas_metadatabool, default
False If True and file has custom pandas schema metadata, ensure thatindex columns are also loaded.
- row_groups
- Returns:
pyarrow.table.TableContent of the row groups as a table (of columns).
Examples
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
>>>parquet_file.read_row_groups([0,0])pyarrow.Tablen_legs: int64animal: string----n_legs: [[2,2,4,4,5,...,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
- scan_contents(columns=None,batch_size=65536)[source]#
Read contents of file for the given columns and batch size.
- Parameters:
- Returns:
- num_rows
int Number of rows in file
- num_rows
Notes
This function’s primary purpose is benchmarking.The scan is executed on a single thread.
Examples
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
>>>parquet_file.scan_contents()6
- propertyschema#
Return the Parquet schema, unconverted to Arrow types
- propertyschema_arrow#
Return the inferred Arrow schema, converted from the whole Parquetfile’s schema
Examples
Generate an example Parquet file:
>>>importpyarrowaspa>>>table=pa.table({'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_table(table,'example.parquet')>>>parquet_file=pq.ParquetFile('example.parquet')
Read the Arrow schema:
>>>parquet_file.schema_arrown_legs: int64animal: string

