pyarrow.dataset.FileSystemDataset#

classpyarrow.dataset.FileSystemDataset(fragments,Schemaschema,FileFormatformat,FileSystemfilesystem=None,root_partition=None)#

Bases:Dataset

A Dataset of file fragments.

A FileSystemDataset is composed of one or more FileFragment.

Parameters:
fragmentslist[Fragments]

List of fragments to consume.

schemaSchema

The top-level schema of the Dataset.

formatFileFormat

File format of the fragments, currently only ParquetFileFormat,IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.

filesystemFileSystem

FileSystem of the fragments.

root_partitionExpression, optional

The top-level partition of the DataDataset.

__init__(*args,**kwargs)#

Methods

__init__(*args, **kwargs)

count_rows(self, Expression filter=None, ...)

Count rows matching the scanner filter.

filter(self, expression)

Apply a row filter to the dataset.

from_paths(cls, paths[, schema, format, ...])

A Dataset created from a list of paths on a particular filesystem.

get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

head(self, int num_rows[, columns])

Load the first N rows of the dataset.

join(self, right_dataset, keys[, ...])

Perform a join between this dataset and another one.

join_asof(self, right_dataset, on, by, tolerance)

Perform an asof join between this dataset and another one.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

scanner(self[, columns, filter])

Build a scan operation against the dataset.

sort_by(self, sorting, **kwargs)

Sort the Dataset by one or multiple columns.

take(self, indices[, columns])

Select rows of data by index.

to_batches(self[, columns])

Read the dataset as materialized record batches.

to_table(self[, columns])

Read the dataset to an Arrow table.

Attributes

files

List of the files

filesystem

format

The FileFormat of this source.

partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

partitioning

The partitioning of the Dataset source, if discovered.

schema

The common schema of the full Dataset

count_rows(self,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Count rows matching the scanner filter.

Parameters:
filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
countint
files#

List of the files

filesystem#
filter(self,expression)#

Apply a row filter to the dataset.

Parameters:
expressionExpression

The filter that should be applied to the dataset.

Returns:
Dataset
format#

The FileFormat of this source.

classmethodfrom_paths(cls,paths,schema=None,format=None,filesystem=None,partitions=None,root_partition=None)#

A Dataset created from a list of paths on a particular filesystem.

Parameters:
pathslist ofstr

List of file paths to create the fragments from.

schemaSchema

The top-level schema of the DataDataset.

formatFileFormat

File format to create fragments from, currently onlyParquetFileFormat, IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.

filesystemFileSystem

The filesystem which files are from.

partitionslist[Expression], optional

Attach additional partition information for the file paths.

root_partitionExpression, optional

The top-level partition of the DataDataset.

get_fragments(self,Expressionfilter=None)#

Returns an iterator over the fragments in this dataset.

Parameters:
filterExpression, defaultNone

Return fragments matching the optional filter, either using thepartition_expression or internal information like Parquet’sstatistics.

Returns:
fragmentsiterator ofFragment
head(self,intnum_rows,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Load the first N rows of the dataset.

Parameters:
num_rowsint

The number of rows to load.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
tableTable
join(self,right_dataset,keys,right_keys=None,join_type='leftouter',left_suffix=None,right_suffix=None,coalesce_keys=True,use_threads=True)#

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where furtheroperations can be applied.

Parameters:
right_datasetdataset

The dataset to join to the current one, acting as the right datasetin the join operation.

keysstr orlist[str]

The columns from current dataset that should be used as keysof the join operation left side.

right_keysstr orlist[str], defaultNone

The columns from the right_dataset that should be used as keyson the join operation right side.WhenNone use the same key names as the left dataset.

join_typestr, default “left outer”

The kind of join that should be performed, one of(“left semi”, “right semi”, “left anti”, “right anti”,“inner”, “left outer”, “right outer”, “full outer”)

left_suffixstr, defaultNone

Which suffix to add to right column names. This prevents confusionwhen the columns in left and right datasets have colliding names.

right_suffixstr, defaultNone

Which suffix to add to the left column names. This prevents confusionwhen the columns in left and right datasets have colliding names.

coalesce_keysbool, defaultTrue

If the duplicated keys should be omitted from one of the sidesin the join result.

use_threadsbool, defaultTrue

Whenever to use multithreading or not.

Returns:
InMemoryDataset
join_asof(self,right_dataset,on,by,tolerance,right_on=None,right_by=None)#

Perform an asof join between this dataset and another one.

This is similar to a left-join except that we match on nearest key ratherthan equal keys. Both datasets must be sorted by the key. This type of joinis most useful for time series data that are not perfectly aligned.

Optionally match on equivalent keys with “by” before searching with “on”.

Result of the join will be a new Dataset, where furtheroperations can be applied.

Parameters:
right_datasetdataset

The dataset to join to the current one, acting as the right datasetin the join operation.

onstr

The column from current dataset that should be used as the “on” keyof the join operation left side.

An inexact match is used on the “on” key, i.e. a row is considered amatch if and only ifright.on-left.on is in the range[min(0,tolerance),max(0,tolerance)].

The input table must be sorted by the “on” key. Must be a singlefield of a common type.

Currently, the “on” key must be an integer, date, or timestamp type.

bystr orlist[str]

The columns from current dataset that should be used as the keysof the join operation left side. The join operation is then doneonly for the matches in these columns.

toleranceint

The tolerance for inexact “on” key matching. A right row is considereda match with a left row ifright.on-left.on is in the range[min(0,tolerance),max(0,tolerance)].tolerance may be:

  • negative, in which case a past-as-of-join occurs(match ifftolerance<=right.on-left.on<=0);

  • or positive, in which case a future-as-of-join occurs(match iff0<=right.on-left.on<=tolerance);

  • or zero, in which case an exact-as-of-join occurs(match iffright.on==left.on).

The tolerance is interpreted in the same units as the “on” key.

right_onstr orlist[str], defaultNone

The columns from the right_dataset that should be used as the on keyon the join operation right side.WhenNone use the same key name as the left dataset.

right_bystr orlist[str], defaultNone

The columns from the right_dataset that should be used as by keyson the join operation right side.WhenNone use the same key names as the left dataset.

Returns:
InMemoryDataset
partition_expression#

An Expression which evaluates to true for all data viewed by thisDataset.

partitioning#

The partitioning of the Dataset source, if discovered.

If the FileSystemDataset is created using thedataset() factoryfunction with a partitioning specified, this will return thefinalized Partitioning object from the dataset discovery. In allother cases, this returns None.

replace_schema(self,Schemaschema)#

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is notcompatible with the original dataset’s schema then an error willbe raised.

Parameters:
schemaSchema

The new dataset schema.

scanner(self,columns=None,filter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner,which exposes further operations (e.g. loading all data as atable, counting rows).

See theScanner.from_dataset() method for further information.

Parameters:
columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
scannerScanner

Examples

>>>importpyarrowaspa>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>>>>importpyarrow.parquetaspq>>>pq.write_table(table,"dataset_scanner.parquet")
>>>importpyarrow.datasetasds>>>dataset=ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>>dataset.scanner(columns=["year","n_legs"]).to_table()pyarrow.Tableyear: int64n_legs: int64----year: [[2020,2022,2021,2022,2019,2021]]n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>>dataset.scanner(columns={..."n_legs_uint":ds.field("n_legs").cast("uint8"),...}).to_table()pyarrow.Tablen_legs_uint: uint8----n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>>dataset.scanner(filter=ds.field("year")>2020).to_table()pyarrow.Tableyear: int64n_legs: int64animal: string----year: [[2022,2021,2022,2021]]n_legs: [[2,4,4,100]]animal: [["Parrot","Dog","Horse","Centipede"]]
schema#

The common schema of the full Dataset

sort_by(self,sorting,**kwargs)#

Sort the Dataset by one or multiple columns.

Parameters:
sortingstr orlist[tuple(name,order)]

Name of the column to use to sort (ascending), ora list of multiple sorting conditions whereeach entry is a tuple with column nameand sorting order (“ascending” or “descending”)

**kwargsdict, optional

Additional sorting options.As allowed bySortOptions

Returns:
InMemoryDataset

A new dataset sorted according to the sort keys.

take(self,indices,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Select rows of data by index.

Parameters:
indicesArray orarray-like

indices of rows to select in the dataset.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
tableTable
to_batches(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Read the dataset as materialized record batches.

Parameters:
columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
record_batchesiterator ofRecordBatch
to_table(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the datasetinto memory.

Parameters:
columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:
tableTable