pyarrow.dataset.Fragment#
- classpyarrow.dataset.Fragment#
Bases:
_WeakrefableFragment of data from a Dataset.
- __init__(*args,**kwargs)#
Methods
__init__(*args, **kwargs)count_rows(self, Expression filter=None, ...)Count rows matching the scanner filter.
head(self, int num_rows[, columns])Load the first N rows of the fragment.
scanner(self, Schema schema=None[, columns])Build a scan operation against the fragment.
take(self, indices[, columns])Select rows of data by index.
to_batches(self, Schema schema=None[, columns])Read the fragment as materialized record batches.
to_table(self, Schema schema=None[, columns])Convert this Fragment into a Table.
Attributes
An Expression which evaluates to true for all data viewed by this Fragment.
Return the physical schema of this Fragment.
- count_rows(self,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Count rows matching the scanner filter.
- Parameters:
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- filter
- Returns:
- count
int
- count
- head(self,intnum_rows,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Load the first N rows of the fragment.
- Parameters:
- num_rows
int The number of rows to load.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- num_rows
- Returns:
- partition_expression#
An Expression which evaluates to true for all data viewed by thisFragment.
- physical_schema#
Return the physical schema of this Fragment. This schema can bedifferent from the dataset read schema.
- scanner(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Build a scan operation against the fragment.
Data is not loaded immediately. Instead, this produces a Scanner,which exposes further operations (e.g. loading all data as atable, counting rows).
- Parameters:
- schema
Schema Schema to use for scanning. This is used to unify a Fragment toits Dataset’s schema. If not specified this will use theFragment’s physical schema which might differ for each Fragment.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- schema
- Returns:
- scanner
Scanner
- scanner
- take(self,indices,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Select rows of data by index.
- Parameters:
- indices
Arrayorarray-like The indices of row to select in the dataset.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- indices
- Returns:
- to_batches(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Read the fragment as materialized record batches.
- Parameters:
- schema
Schema, optional Concrete schema to use for scanning.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- schema
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of
- to_table(self,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Convert this Fragment into a Table.
Use this convenience utility with care. This will serially materializethe Scan result in memory before creating the Table.
- Parameters:
- schema
Schema, optional Concrete schema to use for scanning.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- schema
- Returns:
- table
Table
- table

