pyarrow.parquet.read_table #

pyarrow.parquet.read_table(source,*,columns=None,use_threads=True,schema=None,use_pandas_metadata=False,read_dictionary=None,binary_type=None,list_type=None,memory_map=False,buffer_size=0,partitioning='hive',filesystem=None,filters=None,ignore_prefixes=None,pre_buffer=True,coerce_int96_timestamp_unit=None,decryption_properties=None,thrift_string_size_limit=None,thrift_container_size_limit=None,page_checksum_verification=False,arrow_extensions_enabled=True)[source]#

Read a Table from Parquet format

Parameters:

sourcestr,list ofstr,pyarrow.NativeFile, or file-like object

If a string is passed, can be a single file name or directory name. If alist of strings is passed, should be file names. For file-like objects,only read a single file. Use pyarrow.BufferReader to read a file containedin a bytes or buffer-like object.

columnslist

If not None, only these columns will be read from the file. A columnname may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’,‘a.c’, and ‘a.d.e’. If empty, no columns will be read. Notethat the table will still have the correct num_rows set despite havingno columns.

use_threadsbool, defaultTrue

Perform multi-threaded column reads.

schemaSchema, optional

Optionally provide the Schema for the parquet dataset, in which case itwill not be inferred from the source.

use_pandas_metadatabool, defaultFalse

If True and file has custom pandas schema metadata, ensure thatindex columns are also loaded.

read_dictionarylist, defaultNone

List of names or column paths (for nested types) to read directlyas DictionaryArray. Only supported for BYTE_ARRAY storage. To reada flat column as dictionary-encoded pass the column name. Fornested types, you must pass the full column “path”, which could besomething like level1.level2.list.item. Refer to the Parquetfile’s schema to obtain the paths.

binary_typepyarrow.DataType, defaultNone

If given, Parquet binary columns will be read as this datatype.This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.

list_typesubclass ofpyarrow.DataType, defaultNone

If given, non-MAP repeated columns will be read as an instance ofthis datatype (either pyarrow.ListType or pyarrow.LargeListType).This setting is ignored if a serialized Arrow schema is found inthe Parquet metadata.

memory_mapbool, defaultFalse

If the source is a file path, use a memory map to read file, which canimprove performance in some environments.

buffer_sizeint, default 0

If positive, perform read buffering when deserializing individualcolumn chunks. Otherwise IO calls are unbuffered.

partitioningpyarrow.dataset.Partitioning orstr orlist ofstr, default “hive”

The partitioning scheme for a partitioned dataset. The default of “hive”assumes directory names with key=value pairs like “/year=2009/month=11”.In addition, a scheme like “/2009/11” is also supported, in which caseyou need to specify the field names or a full schema. See thepyarrow.dataset.partitioning() function for more details.

filesystemFileSystem, defaultNone

If nothing passed, will be inferred based on path.Path will try to be found in the local on-disk filesystem otherwiseit will be parsed as an URI to determine the filesystem.

filterspyarrow.compute.Expression orList[Tuple] orList[List[Tuple]], defaultNone

Rows which do not match the filter predicate will be removed from scanneddata. Partition keys embedded in a nested directory structure will beexploited to avoid loading files at all if they contain no matching rows.Within-file level filtering and different partitioning schemes are supported.

Predicates are expressed using anExpression or usingthe disjunctive normal form (DNF), like[[('x','=',0),...],...].DNF allows arbitrary boolean logical combinations of single column predicates.The innermost tuples each describe a single column predicate. The list of innerpredicates is interpreted as a conjunction (AND), forming a more selective andmultiple column predicate. Finally, the most outer list combines these filtersas a disjunction (OR).

Predicates may also be passed as List[Tuple]. This form is interpretedas a single conjunction. To express OR in predicates, one mustuse the (preferred) List[List[Tuple]] notation.

Each tuple has format: (key,op,value) and compares thekey with thevalue.The supportedop are:= or==,!=,<,>,<=,>=,in andnotin. If theop isin ornotin, thevalue must be a collection such as alist, aset or atuple.

Examples:

Using theExpression API:

importpyarrow.computeaspcpc.field('x')=0pc.field('y').isin(['a','b','c'])~pc.field('y').isin({'a','b'})

Using the DNF format:

('x','=',0)('y','in',['a','b','c'])('z','not in',{'a','b'})

ignore_prefixeslist, optional

Files matching any of these prefixes will be ignored by thediscovery process.This is matched to the basename of a path.By default this is [‘.’, ‘_’].Note that discovery happens only if a directory is passed as source.

pre_bufferbool, defaultTrue

Coalesce and issue file reads in parallel to improve performance onhigh-latency filesystems (e.g. S3). If True, Arrow will use abackground I/O thread pool. If using a filesystem layer that itselfperforms readahead (e.g. fsspec’s S3FS), disable readahead for bestresults.

coerce_int96_timestamp_unitstr, defaultNone

Cast timestamps that are stored in INT96 format to a particularresolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’and therefore INT96 timestamps will be inferred as timestampsin nanoseconds.

decryption_propertiesFileDecryptionProperties orNone

File-level decryption properties.The decryption properties can be created usingCryptoFactory.file_decryption_properties().

thrift_string_size_limitint, defaultNone

If not None, override the maximum total string size allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.

thrift_container_size_limitint, defaultNone

If not None, override the maximum total size of containers allocatedwhen decoding Thrift structures. The default limit should besufficient for most Parquet files.

page_checksum_verificationbool, defaultFalse

If True, verify the checksum for each page read from the file.

arrow_extensions_enabledbool, defaultTrue

If True, read Parquet logical types as Arrow extension types where possible,(e.g., read JSON as the canonicalarrow.json extension type or UUID asthe canonicalarrow.uuid extension type).

Returns:

pyarrow.Table: Content of the file as a table (of columns)

Examples

Generate an example PyArrow Table and write it to a partitioned dataset:

>>>importpyarrowaspa>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>importpyarrow.parquetaspq>>>pq.write_to_dataset(table,root_path='dataset_name_2',...partition_cols=['year'])

Read the data:

>>>pq.read_table('dataset_name_2').to_pandas()   n_legs         animal  year0       5  Brittle stars  20191       2       Flamingo  20202       4            Dog  20213     100      Centipede  20214       2         Parrot  20225       4          Horse  2022

Read only a subset of columns:

>>>pq.read_table('dataset_name_2',columns=["n_legs","animal"])pyarrow.Tablen_legs: int64animal: string----n_legs: [[5],[2],[4,100],[2,4]]animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"],["Parrot","Horse"]]

Read a subset of columns and read one column as DictionaryArray:

>>>pq.read_table('dataset_name_2',columns=["n_legs","animal"],...read_dictionary=["animal"])pyarrow.Tablen_legs: int64animal: dictionary<values=string, indices=int32, ordered=0>----n_legs: [[5],[2],[4,100],[2,4]]animal: [  -- dictionary:["Brittle stars"]  -- indices:[0],  -- dictionary:["Flamingo"]  -- indices:[0],  -- dictionary:["Dog","Centipede"]  -- indices:[0,1],  -- dictionary:["Parrot","Horse"]  -- indices:[0,1]]

Read the table with filter:

>>>pq.read_table('dataset_name_2',columns=["n_legs","animal"],...filters=[('n_legs','<',4)]).to_pandas()   n_legs    animal0       2  Flamingo1       2    Parrot

Read data from a single Parquet file:

>>>pq.write_table(table,'example.parquet')>>>pq.read_table('dataset_name_2').to_pandas()   n_legs         animal  year0       5  Brittle stars  20191       2       Flamingo  20202       4            Dog  20213     100      Centipede  20214       2         Parrot  20225       4          Horse  2022

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.parquet.read_table#

pyarrow.parquet.read_table #