pyarrow.dataset.FileSystemDataset #

classpyarrow.dataset.FileSystemDataset(fragments,Schemaschema,FileFormatformat,FileSystemfilesystem=None,root_partition=None)#

Bases:Dataset

A Dataset of file fragments.

A FileSystemDataset is composed of one or more FileFragment.

Parameters:

fragmentslist[Fragments]: List of fragments to consume.
schemaSchema: The top-level schema of the Dataset.
formatFileFormat: File format of the fragments, currently only ParquetFileFormat,IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.
filesystemFileSystem: FileSystem of the fragments.
root_partitionExpression, optional: The top-level partition of the DataDataset.

__init__(*args,**kwargs)#

Methods

`__init__`(args, *kwargs)
`count_rows`(self, Expression filter=None, ...)	Count rows matching the scanner filter.
`filter`(self, expression)	Apply a row filter to the dataset.
`from_paths`(cls, paths[, schema, format, ...])	A Dataset created from a list of paths on a particular filesystem.
`get_fragments`(self, Expression filter=None)	Returns an iterator over the fragments in this dataset.
`head`(self, int num_rows[, columns])	Load the first N rows of the dataset.
`join`(self, right_dataset, keys[, ...])	Perform a join between this dataset and another one.
`join_asof`(self, right_dataset, on, by, tolerance)	Perform an asof join between this dataset and another one.
`replace_schema`(self, Schema schema)	Return a copy of this Dataset with a different schema.
`scanner`(self[, columns, filter])	Build a scan operation against the dataset.
`sort_by`(self, sorting, **kwargs)	Sort the Dataset by one or multiple columns.
`take`(self, indices[, columns])	Select rows of data by index.
`to_batches`(self[, columns])	Read the dataset as materialized record batches.
`to_table`(self[, columns])	Read the dataset to an Arrow table.

Attributes

`files`	List of the files
`filesystem`
`format`	The FileFormat of this source.
`partition_expression`	An Expression which evaluates to true for all data viewed by this Dataset.
`partitioning`	The partitioning of the Dataset source, if discovered.
`schema`	The common schema of the full Dataset

count_rows(self,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Count rows matching the scanner filter.

Parameters:

filterExpression, defaultNone: Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.
batch_sizeint, default 131_072: The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.
batch_readaheadint, default 16: The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.
fragment_readaheadint, default 4: The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.
fragment_scan_optionsFragmentScanOptions, defaultNone: Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.
use_threadsbool, defaultTrue: If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.
cache_metadatabool, defaultTrue: If enabled, metadata may be cached when scanning to speed uprepeated scans.
memory_poolMemoryPool, defaultNone: For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

countint

files#: List of the files

filesystem#

filter(self,expression)#

Apply a row filter to the dataset.

Parameters:

expressionExpression: The filter that should be applied to the dataset.

Returns:

Dataset

format#: The FileFormat of this source.

classmethodfrom_paths(cls,paths,schema=None,format=None,filesystem=None,partitions=None,root_partition=None)#

A Dataset created from a list of paths on a particular filesystem.

Parameters:

pathslist ofstr: List of file paths to create the fragments from.
schemaSchema: The top-level schema of the DataDataset.
formatFileFormat: File format to create fragments from, currently onlyParquetFileFormat, IpcFileFormat, CsvFileFormat, and JsonFileFormat are supported.
filesystemFileSystem: The filesystem which files are from.
partitionslist[Expression], optional: Attach additional partition information for the file paths.
root_partitionExpression, optional: The top-level partition of the DataDataset.

get_fragments(self,Expressionfilter=None)#

Returns an iterator over the fragments in this dataset.

Parameters:

filterExpression, defaultNone: Return fragments matching the optional filter, either using thepartition_expression or internal information like Parquet’sstatistics.

Returns:

fragmentsiterator ofFragment

head(self,intnum_rows,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Load the first N rows of the dataset.

Parameters:

num_rowsint

The number of rows to load.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

The list of columns or expressions may use the special fields__batch_index (the index of the batch within the fragment),__fragment_index (the index of the fragment within the dataset),__last_in_fragment (whether the batch is last in fragment), and__filename (the name of the source file or a description of thesource fragment).

The columns will be passed down to Datasets and corresponding datafragments to avoid loading, copying, and deserializing columnsthat will not be required further down the compute chain.By default all of the available columns are projected. Raisesan exception if any of the referenced column names does not existin the dataset’s Schema.

filterExpression, defaultNone

Scan will return only the rows matching the filter.If possible the predicate will be pushed down to exploit thepartition information or internal metadata found in the datasource, e.g. Parquet statistics. Otherwise filters the loadedRecordBatches before yielding them.

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

tableTable

join(self,right_dataset,keys,right_keys=None,join_type='leftouter',left_suffix=None,right_suffix=None,coalesce_keys=True,use_threads=True)#

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where furtheroperations can be applied.

Parameters:

right_datasetdataset: The dataset to join to the current one, acting as the right datasetin the join operation.
keysstr orlist[str]: The columns from current dataset that should be used as keysof the join operation left side.
right_keysstr orlist[str], defaultNone: The columns from the right_dataset that should be used as keyson the join operation right side.WhenNone use the same key names as the left dataset.
join_typestr, default “left outer”: The kind of join that should be performed, one of(“left semi”, “right semi”, “left anti”, “right anti”,“inner”, “left outer”, “right outer”, “full outer”)
left_suffixstr, defaultNone: Which suffix to add to right column names. This prevents confusionwhen the columns in left and right datasets have colliding names.
right_suffixstr, defaultNone: Which suffix to add to the left column names. This prevents confusionwhen the columns in left and right datasets have colliding names.
coalesce_keysbool, defaultTrue: If the duplicated keys should be omitted from one of the sidesin the join result.
use_threadsbool, defaultTrue: Whenever to use multithreading or not.

Returns:

InMemoryDataset

join_asof(self,right_dataset,on,by,tolerance,right_on=None,right_by=None)#

Perform an asof join between this dataset and another one.

This is similar to a left-join except that we match on nearest key ratherthan equal keys. Both datasets must be sorted by the key. This type of joinis most useful for time series data that are not perfectly aligned.

Optionally match on equivalent keys with “by” before searching with “on”.

Result of the join will be a new Dataset, where furtheroperations can be applied.

Parameters:

right_datasetdataset

The dataset to join to the current one, acting as the right datasetin the join operation.

onstr

The column from current dataset that should be used as the “on” keyof the join operation left side.

An inexact match is used on the “on” key, i.e. a row is considered amatch if and only ifright.on-left.on is in the range[min(0,tolerance),max(0,tolerance)].

The input table must be sorted by the “on” key. Must be a singlefield of a common type.

Currently, the “on” key must be an integer, date, or timestamp type.

bystr orlist[str]

The columns from current dataset that should be used as the keysof the join operation left side. The join operation is then doneonly for the matches in these columns.

toleranceint

The tolerance for inexact “on” key matching. A right row is considereda match with a left row ifright.on-left.on is in the range[min(0,tolerance),max(0,tolerance)].tolerance may be:

negative, in which case a past-as-of-join occurs(match ifftolerance<=right.on-left.on<=0);
or positive, in which case a future-as-of-join occurs(match iff0<=right.on-left.on<=tolerance);
or zero, in which case an exact-as-of-join occurs(match iffright.on==left.on).

The tolerance is interpreted in the same units as the “on” key.

right_onstr orlist[str], defaultNone

The columns from the right_dataset that should be used as the on keyon the join operation right side.WhenNone use the same key name as the left dataset.

right_bystr orlist[str], defaultNone

The columns from the right_dataset that should be used as by keyson the join operation right side.WhenNone use the same key names as the left dataset.

Returns:

InMemoryDataset

partition_expression#: An Expression which evaluates to true for all data viewed by thisDataset.

partitioning#

The partitioning of the Dataset source, if discovered.

If the FileSystemDataset is created using thedataset() factoryfunction with a partitioning specified, this will return thefinalized Partitioning object from the dataset discovery. In allother cases, this returns None.

replace_schema(self,Schemaschema)#

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is notcompatible with the original dataset’s schema then an error willbe raised.

Parameters:

schemaSchema: The new dataset schema.

scanner(self,columns=None,filter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner,which exposes further operations (e.g. loading all data as atable, counting rows).

See theScanner.from_dataset() method for further information.

Parameters:

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

filterExpression, defaultNone

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

scannerScanner

Examples

>>>importpyarrowaspa>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>>>>importpyarrow.parquetaspq>>>pq.write_table(table,"dataset_scanner.parquet")

>>>importpyarrow.datasetasds>>>dataset=ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>>dataset.scanner(columns=["year","n_legs"]).to_table()pyarrow.Tableyear: int64n_legs: int64----year: [[2020,2022,2021,2022,2019,2021]]n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>>dataset.scanner(columns={..."n_legs_uint":ds.field("n_legs").cast("uint8"),...}).to_table()pyarrow.Tablen_legs_uint: uint8----n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>>dataset.scanner(filter=ds.field("year")>2020).to_table()pyarrow.Tableyear: int64n_legs: int64animal: string----year: [[2022,2021,2022,2021]]n_legs: [[2,4,4,100]]animal: [["Parrot","Dog","Horse","Centipede"]]

schema#: The common schema of the full Dataset

sort_by(self,sorting,**kwargs)#

Sort the Dataset by one or multiple columns.

Parameters:

sortingstr orlist[tuple(name,order)]: Name of the column to use to sort (ascending), ora list of multiple sorting conditions whereeach entry is a tuple with column nameand sorting order (“ascending” or “descending”)
**kwargsdict, optional: Additional sorting options.As allowed bySortOptions

Returns:

InMemoryDataset: A new dataset sorted according to the sort keys.

take(self,indices,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Select rows of data by index.

Parameters:

indicesArray orarray-like

indices of rows to select in the dataset.

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

filterExpression, defaultNone

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

tableTable

to_batches(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Read the dataset as materialized record batches.

Parameters:

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

filterExpression, defaultNone

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

record_batchesiterator ofRecordBatch

to_table(self,columns=None,Expressionfilter=None,intbatch_size=_DEFAULT_BATCH_SIZE,intbatch_readahead=_DEFAULT_BATCH_READAHEAD,intfragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,FragmentScanOptionsfragment_scan_options=None,booluse_threads=True,boolcache_metadata=True,MemoryPoolmemory_pool=None)#

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the datasetinto memory.

Parameters:

columnslist ofstr, defaultNone

The columns to project. This can be a list of column names toinclude (order and duplicates will be preserved), or a dictionarywith {new_column_name: expression} values for more advancedprojections.

filterExpression, defaultNone

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scannedrecord batches are overflowing memory then this method can becalled to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not workfor all file formats. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increaseRAM usage but could also improve IO utilization.

fragment_scan_optionsFragmentScanOptions, defaultNone

Options specific to a particular scan and fragment type, whichcan change between different scans of the same dataset.

use_threadsbool, defaultTrue

If enabled, then maximum parallelism will be used determined bythe number of available CPU cores.

cache_metadatabool, defaultTrue

If enabled, metadata may be cached when scanning to speed uprepeated scans.

memory_poolMemoryPool, defaultNone

For memory allocations, if required. If not specified, uses thedefault pool.

Returns:

tableTable

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.dataset.FileSystemDataset#

pyarrow.dataset.FileSystemDataset #