pyarrow.parquet.read_pandas#
- pyarrow.parquet.read_pandas(source,columns=None,**kwargs)[source]#
Read a Table from Parquet format, also reading DataFrameindex values if known in the file metadata
- Parameters:
- source
str,listofstr,pyarrow.NativeFile, or file-like object If a string is passed, can be a single file name or directory name. If alist of strings is passed, should be file names. For file-like objects,only read a single file. Use pyarrow.BufferReader to read a file containedin a bytes or buffer-like object.
- columns
list If not None, only these columns will be read from the file. A columnname may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’,‘a.c’, and ‘a.d.e’. If empty, no columns will be read. Notethat the table will still have the correct num_rows set despite havingno columns.
- use_threadsbool, default
True Perform multi-threaded column reads.
- schema
Schema, optional Optionally provide the Schema for the parquet dataset, in which case itwill not be inferred from the source.
- read_dictionary
list, defaultNone List of names or column paths (for nested types) to read directlyas DictionaryArray. Only supported for BYTE_ARRAY storage. To reada flat column as dictionary-encoded pass the column name. Fornested types, you must pass the full column “path”, which could besomething like level1.level2.list.item. Refer to the Parquetfile’s schema to obtain the paths.
- binary_type
pyarrow.DataType, defaultNone If given, Parquet binary columns will be read as this datatype.This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.
- list_type
subclassofpyarrow.DataType, defaultNone If given, non-MAP repeated columns will be read as an instance ofthis datatype (either pyarrow.ListType or pyarrow.LargeListType).This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.
- memory_mapbool, default
False If the source is a file path, use a memory map to read file, which canimprove performance in some environments.
- buffer_size
int, default 0 If positive, perform read buffering when deserializing individualcolumn chunks. Otherwise IO calls are unbuffered.
- partitioning
pyarrow.dataset.Partitioningorstrorlistofstr, default “hive” The partitioning scheme for a partitioned dataset. The default of “hive”assumes directory names with key=value pairs like “/year=2009/month=11”.In addition, a scheme like “/2009/11” is also supported, in which caseyou need to specify the field names or a full schema. See the
pyarrow.dataset.partitioning()function for more details.- **kwargs
additional options for
read_table()- filesystem
FileSystem, defaultNone If nothing passed, will be inferred based on path.Path will try to be found in the local on-disk filesystem otherwiseit will be parsed as an URI to determine the filesystem.
- filters
pyarrow.compute.ExpressionorList[Tuple] orList[List[Tuple]], defaultNone Rows which do not match the filter predicate will be removed from scanneddata. Partition keys embedded in a nested directory structure will beexploited to avoid loading files at all if they contain no matching rows.Within-file level filtering and different partitioning schemes are supported.
Predicates are expressed using an
Expressionor usingthe disjunctive normal form (DNF), like[[('x','=',0),...],...].DNF allows arbitrary boolean logical combinations of single column predicates.The innermost tuples each describe a single column predicate. The list of innerpredicates is interpreted as a conjunction (AND), forming a more selective andmultiple column predicate. Finally, the most outer list combines these filtersas a disjunction (OR).Predicates may also be passed as List[Tuple]. This form is interpretedas a single conjunction. To express OR in predicates, one mustuse the (preferred) List[List[Tuple]] notation.
Each tuple has format: (
key,op,value) and compares thekeywith thevalue.The supportedopare:=or==,!=,<,>,<=,>=,inandnotin. If theopisinornotin, thevaluemust be a collection such as alist, asetor atuple.Examples:
Using the
ExpressionAPI:importpyarrow.computeaspcpc.field('x')=0pc.field('y').isin(['a','b','c'])~pc.field('y').isin({'a','b'})
Using the DNF format:
('x','=',0)('y','in',['a','b','c'])('z','not in',{'a','b'})
- ignore_prefixes
list, optional Files matching any of these prefixes will be ignored by thediscovery process.This is matched to the basename of a path.By default this is [‘.’, ‘_’].Note that discovery happens only if a directory is passed as source.
- pre_bufferbool, default
True Coalesce and issue file reads in parallel to improve performance onhigh-latency filesystems (e.g. S3). If True, Arrow will use abackground I/O thread pool. If using a filesystem layer that itselfperforms readahead (e.g. fsspec’s S3FS), disable readahead for bestresults.
- coerce_int96_timestamp_unit
str, defaultNone Cast timestamps that are stored in INT96 format to a particularresolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’and therefore INT96 timestamps will be inferred as timestampsin nanoseconds.
- decryption_properties
FileDecryptionPropertiesorNone File-level decryption properties.The decryption properties can be created using
CryptoFactory.file_decryption_properties().- thrift_string_size_limit
int, defaultNone If not None, override the maximum total string size allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.
- thrift_container_size_limit
int, defaultNone If not None, override the maximum total size of containers allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.
- page_checksum_verificationbool, default
False If True, verify the checksum for each page read from the file.
- arrow_extensions_enabledbool, default
True If True, read Parquet logical types as Arrow extension types where possible,(e.g., read JSON as the canonicalarrow.json extension type or UUID asthe canonicalarrow.uuid extension type).
- source
- Returns:
pyarrow.TableContent of the file as a Table of Columns, including DataFrameindexes as columns

