pyarrow.dataset.FileSystemDataset#
- classpyarrow.dataset.FileSystemDataset(fragments,Schemaschema,FileFormatformat,FileSystemfilesystem=None,root_partition=None)#
Bases:
DatasetA Dataset of file fragments.
A FileSystemDataset is composed of one or more FileFragment.
- Parameters:
- fragments
list[Fragments] List of fragments to consume.
- schema
Schema The top-level schema of the Dataset.
- format
FileFormat File format of the fragments, currently only ParquetFileFormat,IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.
- filesystem
FileSystem FileSystem of the fragments.
- root_partition
Expression, optional The top-level partition of the DataDataset.
- fragments
- __init__(*args,**kwargs)#
Methods
__init__(*args, **kwargs)count_rows(self, Expression filter=None, ...)Count rows matching the scanner filter.
filter(self, expression)Apply a row filter to the dataset.
from_paths(cls, paths[, schema, format, ...])A Dataset created from a list of paths on a particular filesystem.
get_fragments(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head(self, int num_rows[, columns])Load the first N rows of the dataset.
join(self, right_dataset, keys[, ...])Perform a join between this dataset and another one.
join_asof(self, right_dataset, on, by, tolerance)Perform an asof join between this dataset and another one.
replace_schema(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner(self[, columns, filter])Build a scan operation against the dataset.
sort_by(self, sorting, **kwargs)Sort the Dataset by one or multiple columns.
take(self, indices[, columns])Select rows of data by index.
to_batches(self[, columns])Read the dataset as materialized record batches.
to_table(self[, columns])Read the dataset to an Arrow table.
Attributes
List of the files
The FileFormat of this source.
An Expression which evaluates to true for all data viewed by this Dataset.
The partitioning of the Dataset source, if discovered.
The common schema of the full Dataset
- count_rows(self,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Count rows matching the scanner filter.
- Parameters:
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- filter
- Returns:
- count
int
- count
- files#
List of the files
- filesystem#
- filter(self,expression)#
Apply a row filter to the dataset.
- Parameters:
- expression
Expression The filter that should be applied to the dataset.
- expression
- Returns:
- format#
The FileFormat of this source.
- classmethodfrom_paths(cls,paths,schema=None,format=None,filesystem=None,partitions=None,root_partition=None)#
A Dataset created from a list of paths on a particular filesystem.
- Parameters:
- paths
listofstr List of file paths to create the fragments from.
- schema
Schema The top-level schema of the DataDataset.
- format
FileFormat File format to create fragments from, currently onlyParquetFileFormat, IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.
- filesystem
FileSystem The filesystem which files are from.
- partitions
list[Expression], optional Attach additional partition information for the file paths.
- root_partition
Expression, optional The top-level partition of the DataDataset.
- paths
- get_fragments(self,Expressionfilter=None)#
Returns an iterator over the fragments in this dataset.
- Parameters:
- filter
Expression, defaultNone Return fragments matching the optional filter, either using thepartition_expression or internal information like Parquet’sstatistics.
- filter
- Returns:
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self,intnum_rows,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Load the first N rows of the dataset.
- Parameters:
- num_rows
int The number of rows to load.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- num_rows
- Returns:
- table
Table
- table
- join(self,right_dataset,keys,right_keys=None,join_type='leftouter',left_suffix=None,right_suffix=None,coalesce_keys=True,use_threads=True)#
Perform a join between this dataset and another one.
Result of the join will be a new dataset, where furtheroperations can be applied.
- Parameters:
- right_datasetdataset
The dataset to join to the current one, acting as the right datasetin the join operation.
- keys
strorlist[str] The columns from current dataset that should be used as keysof the join operation left side.
- right_keys
strorlist[str], defaultNone The columns from the right_dataset that should be used as keyson the join operation right side.When
Noneuse the same key names as the left dataset.- join_type
str, default “left outer” The kind of join that should be performed, one of(“left semi”, “right semi”, “left anti”, “right anti”,“inner”, “left outer”, “right outer”, “full outer”)
- left_suffix
str, defaultNone Which suffix to add to right column names. This prevents confusionwhen the columns in left and right datasets have colliding names.
- right_suffix
str, defaultNone Which suffix to add to the left column names. This prevents confusionwhen the columns in left and right datasets have colliding names.
- coalesce_keysbool, default
True If the duplicated keys should be omitted from one of the sidesin the join result.
- use_threadsbool, default
True Whenever to use multithreading or not.
- Returns:
- join_asof(self,right_dataset,on,by,tolerance,right_on=None,right_by=None)#
Perform an asof join between this dataset and another one.
This is similar to a left-join except that we match on nearest key ratherthan equal keys. Both datasets must be sorted by the key. This type of joinis most useful for time series data that are not perfectly aligned.
Optionally match on equivalent keys with “by” before searching with “on”.
Result of the join will be a new Dataset, where furtheroperations can be applied.
- Parameters:
- right_datasetdataset
The dataset to join to the current one, acting as the right datasetin the join operation.
- on
str The column from current dataset that should be used as the “on” keyof the join operation left side.
An inexact match is used on the “on” key, i.e. a row is considered amatch if and only if
right.on-left.onis in the range[min(0,tolerance),max(0,tolerance)].The input table must be sorted by the “on” key. Must be a singlefield of a common type.
Currently, the “on” key must be an integer, date, or timestamp type.
- by
strorlist[str] The columns from current dataset that should be used as the keysof the join operation left side. The join operation is then doneonly for the matches in these columns.
- tolerance
int The tolerance for inexact “on” key matching. A right row is considereda match with a left row if
right.on-left.onis in the range[min(0,tolerance),max(0,tolerance)].tolerancemay be:negative, in which case a past-as-of-join occurs(match iff
tolerance<=right.on-left.on<=0);or positive, in which case a future-as-of-join occurs(match iff
0<=right.on-left.on<=tolerance);or zero, in which case an exact-as-of-join occurs(match iff
right.on==left.on).
The tolerance is interpreted in the same units as the “on” key.
- right_on
strorlist[str], defaultNone The columns from the right_dataset that should be used as the on keyon the join operation right side.When
Noneuse the same key name as the left dataset.- right_by
strorlist[str], defaultNone The columns from the right_dataset that should be used as by keyson the join operation right side.When
Noneuse the same key names as the left dataset.
- Returns:
- partition_expression#
An Expression which evaluates to true for all data viewed by thisDataset.
- partitioning#
The partitioning of the Dataset source, if discovered.
If the FileSystemDataset is created using the
dataset()factoryfunction with a partitioning specified, this will return thefinalized Partitioning object from the dataset discovery. In allother cases, this returns None.
- replace_schema(self,Schemaschema)#
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is notcompatible with the original dataset’s schema then an error willbe raised.
- Parameters:
- schema
Schema The new dataset schema.
- schema
- scanner(self,columns=None,filter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Build a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner,which exposes further operations (e.g. loading all data as atable, counting rows).
See the
Scanner.from_dataset()method for further information.- Parameters:
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- columns
- Returns:
- scanner
Scanner
- scanner
Examples
>>>importpyarrowaspa>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>>>>importpyarrow.parquetaspq>>>pq.write_table(table,"dataset_scanner.parquet")
>>>importpyarrow.datasetasds>>>dataset=ds.dataset("dataset_scanner.parquet")
Selecting a subset of the columns:
>>>dataset.scanner(columns=["year","n_legs"]).to_table()pyarrow.Tableyear: int64n_legs: int64----year: [[2020,2022,2021,2022,2019,2021]]n_legs: [[2,2,4,4,5,100]]
Projecting selected columns using an expression:
>>>dataset.scanner(columns={..."n_legs_uint":ds.field("n_legs").cast("uint8"),...}).to_table()pyarrow.Tablen_legs_uint: uint8----n_legs_uint: [[2,2,4,4,5,100]]
Filtering rows while scanning:
>>>dataset.scanner(filter=ds.field("year")>2020).to_table()pyarrow.Tableyear: int64n_legs: int64animal: string----year: [[2022,2021,2022,2021]]n_legs: [[2,4,4,100]]animal: [["Parrot","Dog","Horse","Centipede"]]
- schema#
The common schema of the full Dataset
- sort_by(self,sorting,**kwargs)#
Sort the Dataset by one or multiple columns.
- Parameters:
- Returns:
InMemoryDatasetA new dataset sorted according to the sort keys.
- take(self,indices,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Select rows of data by index.
- Parameters:
- indices
Arrayorarray-like indices of rows to select in the dataset.
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- indices
- Returns:
- table
Table
- table
- to_batches(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Read the dataset as materialized record batches.
- Parameters:
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- columns
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of
- to_table(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#
Read the dataset to an Arrow table.
Note that this method reads all the selected data from the datasetinto memory.
- Parameters:
- columns
listofstr, defaultNone The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.
The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).
The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.
- filter
Expression, defaultNone Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
- batch_size
int, default 131_072 The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
- batch_readahead
int, default 16 The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_readahead
int, default 4 The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions, defaultNone Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
- use_threadsbool, default
True If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
- cache_metadatabool, default
True If enabled, metadata may be cached when scanning to speed uprepeated scans.
- memory_pool
MemoryPool, defaultNone For memory allocations, if required. If not specified, uses thedefault pool.
- columns
- Returns:
- table
Table
- table

