pyarrow.dataset.dataset#

pyarrow.dataset.dataset(source,schema=None,format=None,filesystem=None,partitioning=None,partition_base_dir=None,exclude_invalid_files=None,ignore_prefixes=None)[source]#

Open a dataset.

Datasets provides functionality to efficiently work with tabular,potentially larger than memory and multi-file dataset.

  • A unified interface for different sources, like Parquet and Feather

  • Discovery of sources (crawling directories, handle directory-basedpartitioned datasets, basic schema normalization)

  • Optimized reading with predicate pushdown (filtering rows), projection(selecting columns), parallel reading or fine-grained managing of tasks.

Note that this is the high-level API, to have more control over the datasetconstruction use the low-level API classes (FileSystemDataset,FilesystemDatasetFactory, etc.)

Parameters:
sourcepath,list of paths, dataset,list of datasets, (list of)RecordBatch orTable,iterable ofRecordBatch, RecordBatchReader, or URI
Path pointing to a single file:

Open a FileSystemDataset from a single file.

Path pointing to a directory:

The directory gets discovered recursively according to apartitioning scheme if given.

List of file paths:

Create a FileSystemDataset from explicitly given files. The filesmust be located on the same filesystem given by the filesystemparameter.Note that in contrary of construction from a single file, passingURIs as paths is not allowed.

List of datasets:

A nested UnionDataset gets constructed, it allows arbitrarycomposition of other datasets.Note that additional keyword arguments are not allowed.

(List of) batches or tables, iterable of batches, or RecordBatchReader:

Create an InMemoryDataset. If an iterable or empty list is given,a schema must also be given. If an iterable or RecordBatchReaderis given, the resulting dataset can only be scanned once; furtherattempts will raise an error.

schemaSchema, optional

Optionally provide the Schema for the Dataset, in which case it willnot be inferred from the source.

formatFileFormat orstr

Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, “json”, and “orc” aresupported. For Feather, only version 2 files are supported.

filesystemFileSystem or URIstr, defaultNone

If a single path is given as source and filesystem is None, then thefilesystem will be inferred from the path.If an URI string is passed, then a filesystem object is constructedusing the URI’s optional path component as a directory prefix. See theexamples below.Note that the URIs on Windows must follow ‘file:///C:…’ or‘file:/C:…’ patterns.

partitioningPartitioning,PartitioningFactory,str,list ofstr

The partitioning scheme specified with thepartitioning()function. A flavor string can be used as shortcut, and with a list offield names a DirectoryPartitioning will be inferred.

partition_base_dirstr, optional

For the purposes of applying the partitioning, paths will bestripped of the partition_base_dir. Files not matching thepartition_base_dir prefix will be skipped for partitioning discovery.The ignored files will still be part of the Dataset, but will nothave partition information.

exclude_invalid_filesbool, optional (defaultFalse)

If True, invalid files will be excluded (file format specific check).This will incur IO for each files in a serial and single threadedfashion. Disabling this feature will skip the IO, but unsupportedfiles may be present in the Dataset (resulting in an error at scantime).

ignore_prefixeslist, optional

Files matching any of these prefixes will be ignored by thediscovery process. This is matched to the basename of a path.By default this is [‘.’, ‘_’].Note that discovery happens only if a directory is passed as source.

Returns:
datasetDataset

Either a FileSystemDataset or a UnionDataset depending on the sourceparameter.

Examples

Creating an example Table:

>>>importpyarrowaspa>>>importpyarrow.parquetaspq>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>pq.write_table(table,"file.parquet")

Opening a single file:

>>>importpyarrow.datasetasds>>>dataset=ds.dataset("file.parquet",format="parquet")>>>dataset.to_table()pyarrow.Tableyear: int64n_legs: int64animal: string----year: [[2020,2022,2021,2022,2019,2021]]n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Opening a single file with an explicit schema:

>>>myschema=pa.schema([...('n_legs',pa.int64()),...('animal',pa.string())])>>>dataset=ds.dataset("file.parquet",schema=myschema,format="parquet")>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Opening a dataset for a single directory:

>>>ds.write_dataset(table,"partitioned_dataset",format="parquet",...partitioning=['year'])>>>dataset=ds.dataset("partitioned_dataset",format="parquet")>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[5],[2],[4,100],[2,4]]animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]

For a single directory from a S3 bucket:

>>>ds.dataset("s3://mybucket/nyc-taxi/",...format="parquet")

Opening a dataset from a list of relatives local paths:

>>>dataset=ds.dataset([..."partitioned_dataset/2019/part-0.parquet",..."partitioned_dataset/2020/part-0.parquet",..."partitioned_dataset/2021/part-0.parquet",...],format='parquet')>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[5],[2],[4,100]]animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"]]

With filesystem provided:

>>>paths=[...'part0/data.parquet',...'part1/data.parquet',...'part3/data.parquet',...]>>>ds.dataset(paths,filesystem='file:///directory/prefix,...format='parquet')

Which is equivalent with:

>>>fs=SubTreeFileSystem("/directory/prefix",...LocalFileSystem())>>>ds.dataset(paths,filesystem=fs,format='parquet')

With a remote filesystem URI:

>>>paths=[...'nested/directory/part0/data.parquet',...'nested/directory/part1/data.parquet',...'nested/directory/part3/data.parquet',...]>>>ds.dataset(paths,filesystem='s3://bucket/',...format='parquet')

Similarly to the local example, the directory prefix may be included in thefilesystem URI:

>>>ds.dataset(paths,filesystem='s3://bucket/nested/directory',...format='parquet')

Construction of a nested dataset:

>>>ds.dataset([...dataset("s3://old-taxi-data",format="parquet"),...dataset("local/path/to/data",format="ipc")...])