pyarrow.dataset.ParquetFileFragment#

classpyarrow.dataset.ParquetFileFragment#

Bases:FileFragment

A Fragment representing a parquet file.

__init__(*args,**kwargs)#

Methods

__init__(*args, **kwargs)

count_rows(self, Expression filter=None, ...)

Count rows matching the scanner filter.

ensure_complete_metadata(self)

Ensure that all metadata (statistics, physical schema, ...) have been read and cached in this fragment.

head(self, int num_rows[, columns])

Load the first N rows of the fragment.

open(self)

Open a NativeFile of the buffer or file viewed by this fragment.

scanner(self, Schema schema=None[, columns])

Build a scan operation against the fragment.

split_by_row_group(self, ...)

Split the fragment into multiple fragments.

subset(self, Expression filter=None, ...[, ...])

Create a subset of the fragment (viewing a subset of the row groups).

take(self, indices[, columns])

Select rows of data by index.

to_batches(self, Schema schema=None[, columns])

Read the fragment as materialized record batches.

to_table(self, Schema schema=None[, columns])

Convert this Fragment into a Table.

Attributes

buffer

The buffer viewed by this fragment, if it views a buffer.

filesystem

The FileSystem containing the data file viewed by this fragment, if it views a file.

format

The format of the data file viewed by this fragment.

metadata

num_row_groups

Return the number of row groups viewed by this fragment (not the number of row groups in the origin file).

partition_expression

An Expression which evaluates to true for all data viewed by this Fragment.

path

The path of the data file viewed by this fragment, if it views a file.

physical_schema

Return the physical schema of this Fragment.

row_groups

buffer#

The buffer viewed by this fragment, if it views a buffer. Ifinstead it views a file, this will be None.

count_rows(self,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Count rows matching the scanner filter.

Parameters:
filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
countint
ensure_complete_metadata(self)#

Ensure that all metadata (statistics, physical schema, …) havebeen read and cached in this fragment.

filesystem#

The FileSystem containing the data file viewed by this fragment, ifit views a file. If instead it views a buffer, this will be None.

format#

The format of the data file viewed by this fragment.

head(self,intnum_rows,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Load the first N rows of the fragment.

Parameters:
num_rowsint

The number of rows to load.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
Table
metadata#
num_row_groups#

Return the number of row groups viewed by this fragment (not thenumber of row groups in the origin file).

open(self)#

Open a NativeFile of the buffer or file viewed by this fragment.

partition_expression#

An Expression which evaluates to true for all data viewed by thisFragment.

path#

The path of the data file viewed by this fragment, if it views afile. If instead it views a buffer, this will be “<Buffer>”.

physical_schema#

Return the physical schema of this Fragment. This schema can bedifferent from the dataset read schema.

row_groups#
scanner(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Build a scan operation against the fragment.

Data is not loaded immediately. Instead, this produces a Scanner,which exposes further operations (e.g. loading all data as atable, counting rows).

Parameters:
schemaSchema

Schema to use for scanning. This is used to unify a Fragment toits Dataset’s schema. If not specified this will use theFragment’s physical schema which might differ for each Fragment.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
scannerScanner
split_by_row_group(self,Expressionfilter=None,Schemaschema=None)#

Split the fragment into multiple fragments.

Yield a Fragment wrapping each row group in this ParquetFileFragment.Row groups will be excluded whose metadata contradicts the optionalfilter.

Parameters:
filterExpression, defaultNone

Only include the row groups which satisfy this predicate (usingthe Parquet RowGroup statistics).

schemaSchema, defaultNone

Schema to use when filtering row groups. Defaults to theFragment’s physical schema

Returns:
Alist ofFragments
subset(self,Expressionfilter=None,Schemaschema=None,row_group_ids=None)#

Create a subset of the fragment (viewing a subset of the row groups).

Subset can be specified by either a filter predicate (with optionalschema) or by a list of row group IDs. Note that when using a filter,the resulting fragment can be empty (viewing no row groups).

Parameters:
filterExpression, defaultNone

Only include the row groups which satisfy this predicate (usingthe Parquet RowGroup statistics).

schemaSchema, defaultNone

Schema to use when filtering row groups. Defaults to theFragment’s physical schema

row_group_idslist ofints

The row group IDs to include in the subset. Can only be specifiediffilter is None.

Returns:
ParquetFileFragment
take(self,indices,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Select rows of data by index.

Parameters:
indicesArray orarray-like

The indices of row to select in the dataset.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
Table
to_batches(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Read the fragment as materialized record batches.

Parameters:
schemaSchema, optional

Concrete schema to use for scanning.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
record_batchesiterator ofRecordBatch
to_table(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Convert this Fragment into a Table.

Use this convenience utility with care. This will serially materializethe Scan result in memory before creating the Table.

Parameters:
schemaSchema, optional

Concrete schema to use for scanning.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
tableTable