pyarrow.parquet.read_pandas#

pyarrow.parquet.read_pandas(source,columns=None,**kwargs)[source]#

Read a Table from Parquet format, also reading DataFrameindex values if known in the file metadata

Parameters:
sourcestr,list ofstr,pyarrow.NativeFile, or file-like object

If a string is passed, can be a single file name or directory name. If alist of strings is passed, should be file names. For file-like objects,only read a single file. Use pyarrow.BufferReader to read a file containedin a bytes or buffer-like object.

columnslist

If not None, only these columns will be read from the file. A columnname may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’,‘a.c’, and ‘a.d.e’. If empty, no columns will be read. Notethat the table will still have the correct num_rows set despite havingno columns.

use_threadsbool, defaultTrue

Perform multi-threaded column reads.

schemaSchema, optional

Optionally provide the Schema for the parquet dataset, in which case itwill not be inferred from the source.

read_dictionarylist, defaultNone

List of names or column paths (for nested types) to read directlyas DictionaryArray. Only supported for BYTE_ARRAY storage. To reada flat column as dictionary-encoded pass the column name. Fornested types, you must pass the full column “path”, which could besomething like level1.level2.list.item. Refer to the Parquetfile’s schema to obtain the paths.

binary_typepyarrow.DataType, defaultNone

If given, Parquet binary columns will be read as this datatype.This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.

list_typesubclass ofpyarrow.DataType, defaultNone

If given, non-MAP repeated columns will be read as an instance ofthis datatype (either pyarrow.ListType or pyarrow.LargeListType).This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.

memory_mapbool, defaultFalse

If the source is a file path, use a memory map to read file, which canimprove performance in some environments.

buffer_sizeint, default 0

If positive, perform read buffering when deserializing individualcolumn chunks. Otherwise IO calls are unbuffered.

partitioningpyarrow.dataset.Partitioning orstr orlist ofstr, default “hive”

The partitioning scheme for a partitioned dataset. The default of “hive”assumes directory names with key=value pairs like “/year=2009/month=11”.In addition, a scheme like “/2009/11” is also supported, in which caseyou need to specify the field names or a full schema. See thepyarrow.dataset.partitioning() function for more details.

**kwargs

additional options forread_table()

filesystemFileSystem, defaultNone

If nothing passed, will be inferred based on path.Path will try to be found in the local on-disk filesystem otherwiseit will be parsed as an URI to determine the filesystem.

filterspyarrow.compute.Expression orList[Tuple] orList[List[Tuple]], defaultNone

Rows which do not match the filter predicate will be removed from scanneddata. Partition keys embedded in a nested directory structure will beexploited to avoid loading files at all if they contain no matching rows.Within-file level filtering and different partitioning schemes are supported.

Predicates are expressed using anExpression or usingthe disjunctive normal form (DNF), like[[('x','=',0),...],...].DNF allows arbitrary boolean logical combinations of single column predicates.The innermost tuples each describe a single column predicate. The list of innerpredicates is interpreted as a conjunction (AND), forming a more selective andmultiple column predicate. Finally, the most outer list combines these filtersas a disjunction (OR).

Predicates may also be passed as List[Tuple]. This form is interpretedas a single conjunction. To express OR in predicates, one mustuse the (preferred) List[List[Tuple]] notation.

Each tuple has format: (key,op,value) and compares thekey with thevalue.The supportedop are:= or==,!=,<,>,<=,>=,in andnotin. If theop isin ornotin, thevalue must be a collection such as alist, aset or atuple.

Examples:

Using theExpression API:

importpyarrow.computeaspcpc.field('x')=0pc.field('y').isin(['a','b','c'])~pc.field('y').isin({'a','b'})

Using the DNF format:

('x','=',0)('y','in',['a','b','c'])('z','not in',{'a','b'})
ignore_prefixeslist, optional

Files matching any of these prefixes will be ignored by thediscovery process.This is matched to the basename of a path.By default this is [‘.’, ‘_’].Note that discovery happens only if a directory is passed as source.

pre_bufferbool, defaultTrue

Coalesce and issue file reads in parallel to improve performance onhigh-latency filesystems (e.g. S3). If True, Arrow will use abackground I/O thread pool. If using a filesystem layer that itselfperforms readahead (e.g. fsspec’s S3FS), disable readahead for bestresults.

coerce_int96_timestamp_unitstr, defaultNone

Cast timestamps that are stored in INT96 format to a particularresolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’and therefore INT96 timestamps will be inferred as timestampsin nanoseconds.

decryption_propertiesFileDecryptionProperties orNone

File-level decryption properties.The decryption properties can be created usingCryptoFactory.file_decryption_properties().

thrift_string_size_limitint, defaultNone

If not None, override the maximum total string size allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.

thrift_container_size_limitint, defaultNone

If not None, override the maximum total size of containers allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.

page_checksum_verificationbool, defaultFalse

If True, verify the checksum for each page read from the file.

arrow_extensions_enabledbool, defaultTrue

If True, read Parquet logical types as Arrow extension types where possible,(e.g., read JSON as the canonicalarrow.json extension type or UUID asthe canonicalarrow.uuid extension type).

Returns:
pyarrow.Table

Content of the file as a Table of Columns, including DataFrameindexes as columns