pyarrow.dataset.Scanner#

classpyarrow.dataset.Scanner#

Bases:_Weakrefable

A materialized scan operation with context and options bound.

A scanner is the class that glues the scan tasks, data fragments and datasources together.

__init__(*args,**kwargs)#

Methods

__init__(*args, **kwargs)

count_rows(self)

Count rows matching the scanner filter.

from_batches(source, *, Schema schema=None)

Create a Scanner from an iterator of batches.

from_dataset(Dataset dataset, *[, columns, ...])

Create Scanner from Dataset,

from_fragment(Fragment fragment, *, ...[, ...])

Create Scanner from Fragment,

head(self, int num_rows)

Load the first N rows of the dataset.

scan_batches(self)

Consume a Scanner in record batches with corresponding fragments.

take(self, indices)

Select rows of data by index.

to_batches(self)

Consume a Scanner in record batches.

to_reader(self)

Consume this scanner as a RecordBatchReader.

to_table(self)

Convert a Scanner into a Table.

Attributes

dataset_schema

The schema with which batches will be read from fragments.

projected_schema

The materialized schema of the data, accounting for projections.

count_rows(self)#

Count rows matching the scanner filter.

Returns:
countint
dataset_schema#

The schema with which batches will be read from fragments.

staticfrom_batches(source,*,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Create a Scanner from an iterator of batches.

This creates a scanner which can be used only once. It isintended to support writing a dataset (which takes a scanner)from a source which can be read only once (e.g. aRecordBatchReader or generator).

Parameters:
sourceIterator or Arrow-compatiblestream object

The iterator of Batches. This can be a pyarrow RecordBatchReader,any object that implements the Arrow PyCapsule Protocol forstreams, or an actual Python iterator of RecordBatches.

schemaSchema

The schema of the batches (required when passing a Pythoniterator).

columnslist[str] ordict[str,Expression], defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

staticfrom_dataset(Datasetdataset,*,columns=None,filter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Create Scanner from Dataset,

Parameters:
datasetDataset

Dataset to scan.

columnslist[str] ordict[str,Expression], defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

staticfrom_fragment(Fragmentfragment,*,Schemaschema=None,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Create Scanner from Fragment,

Parameters:
fragmentFragment

fragment to scan.

schemaSchema, optional

The schema of the fragment.

columnslist[str] ordict[str,Expression], defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

head(self,intnum_rows)#

Load the first N rows of the dataset.

Parameters:
num_rowsint

The number of rows to load.

Returns:
Table
projected_schema#

The materialized schema of the data, accounting for projections.

This is the schema of any data returned from the scanner.

scan_batches(self)#

Consume a Scanner in record batches with corresponding fragments.

Returns:
record_batchesiterator ofTaggedRecordBatch
take(self,indices)#

Select rows of data by index.

Will only consume as many batches of the underlying dataset asneeded. Otherwise, this is equivalent toto_table().take(indices).

Parameters:
indicesArray orarray-like

indices of rows to select in the dataset.

Returns:
Table
to_batches(self)#

Consume a Scanner in record batches.

Returns:
record_batchesiterator ofRecordBatch
to_reader(self)#

Consume this scanner as a RecordBatchReader.

Returns:
RecordBatchReader
to_table(self)#

Convert a Scanner into a Table.

Use this convenience utility with care. This will serially materializethe Scan result in memory before creating the Table.

Returns:
Table