pyarrow.dataset.dataset#
- pyarrow.dataset.dataset(source,schema=None,format=None,filesystem=None,partitioning=None,partition_base_dir=None,exclude_invalid_files=None,ignore_prefixes=None)[source]#
Open a dataset.
Datasets provides functionality to efficiently work with tabular,potentially larger than memory and multi-file dataset.
A unified interface for different sources, like Parquet and Feather
Discovery of sources (crawling directories, handle directory-basedpartitioned datasets, basic schema normalization)
Optimized reading with predicate pushdown (filtering rows), projection(selecting columns), parallel reading or fine-grained managing of tasks.
Note that this is the high-level API, to have more control over the datasetconstruction use the low-level API classes (FileSystemDataset,FilesystemDatasetFactory, etc.)
- Parameters:
- sourcepath,
listof paths, dataset,listof datasets, (listof)RecordBatchorTable,iterable ofRecordBatch, RecordBatchReader, or URI - Path pointing to a single file:
Open a FileSystemDataset from a single file.
- Path pointing to a directory:
The directory gets discovered recursively according to apartitioning scheme if given.
- List of file paths:
Create a FileSystemDataset from explicitly given files. The filesmust be located on the same filesystem given by the filesystemparameter.Note that in contrary of construction from a single file, passingURIs as paths is not allowed.
- List of datasets:
A nested UnionDataset gets constructed, it allows arbitrarycomposition of other datasets.Note that additional keyword arguments are not allowed.
- (List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given,a schema must also be given. If an iterable or RecordBatchReaderis given, the resulting dataset can only be scanned once; furtherattempts will raise an error.
- schema
Schema, optional Optionally provide the Schema for the Dataset, in which case it willnot be inferred from the source.
- format
FileFormatorstr Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, “json”, and “orc” aresupported. For Feather, only version 2 files are supported.
- filesystem
FileSystemor URIstr, defaultNone If a single path is given as source and filesystem is None, then thefilesystem will be inferred from the path.If an URI string is passed, then a filesystem object is constructedusing the URI’s optional path component as a directory prefix. See theexamples below.Note that the URIs on Windows must follow ‘file:///C:…’ or‘file:/C:…’ patterns.
- partitioning
Partitioning,PartitioningFactory,str,listofstr The partitioning scheme specified with the
partitioning()function. A flavor string can be used as shortcut, and with a list offield names a DirectoryPartitioning will be inferred.- partition_base_dir
str, optional For the purposes of applying the partitioning, paths will bestripped of the partition_base_dir. Files not matching thepartition_base_dir prefix will be skipped for partitioning discovery.The ignored files will still be part of the Dataset, but will nothave partition information.
- exclude_invalid_filesbool, optional (default
False) If True, invalid files will be excluded (file format specific check).This will incur IO for each files in a serial and single threadedfashion. Disabling this feature will skip the IO, but unsupportedfiles may be present in the Dataset (resulting in an error at scantime).
- ignore_prefixes
list, optional Files matching any of these prefixes will be ignored by thediscovery process. This is matched to the basename of a path.By default this is [‘.’, ‘_’].Note that discovery happens only if a directory is passed as source.
- sourcepath,
- Returns:
- dataset
Dataset Either a FileSystemDataset or a UnionDataset depending on the sourceparameter.
- dataset
Examples
Creating an example Table:
>>>importpyarrowaspa>>>importpyarrow.parquetaspq>>>table=pa.table({'year':[2020,2022,2021,2022,2019,2021],...'n_legs':[2,2,4,4,5,100],...'animal':["Flamingo","Parrot","Dog","Horse",..."Brittle stars","Centipede"]})>>>pq.write_table(table,"file.parquet")
Opening a single file:
>>>importpyarrow.datasetasds>>>dataset=ds.dataset("file.parquet",format="parquet")>>>dataset.to_table()pyarrow.Tableyear: int64n_legs: int64animal: string----year: [[2020,2022,2021,2022,2019,2021]]n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a single file with an explicit schema:
>>>myschema=pa.schema([...('n_legs',pa.int64()),...('animal',pa.string())])>>>dataset=ds.dataset("file.parquet",schema=myschema,format="parquet")>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[2,2,4,4,5,100]]animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a dataset for a single directory:
>>>ds.write_dataset(table,"partitioned_dataset",format="parquet",...partitioning=['year'])>>>dataset=ds.dataset("partitioned_dataset",format="parquet")>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[5],[2],[4,100],[2,4]]animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]
For a single directory from a S3 bucket:
>>>ds.dataset("s3://mybucket/nyc-taxi/",...format="parquet")
Opening a dataset from a list of relatives local paths:
>>>dataset=ds.dataset([..."partitioned_dataset/2019/part-0.parquet",..."partitioned_dataset/2020/part-0.parquet",..."partitioned_dataset/2021/part-0.parquet",...],format='parquet')>>>dataset.to_table()pyarrow.Tablen_legs: int64animal: string----n_legs: [[5],[2],[4,100]]animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"]]
With filesystem provided:
>>>paths=[...'part0/data.parquet',...'part1/data.parquet',...'part3/data.parquet',...]>>>ds.dataset(paths,filesystem='file:///directory/prefix,...format='parquet')
Which is equivalent with:
>>>fs=SubTreeFileSystem("/directory/prefix",...LocalFileSystem())>>>ds.dataset(paths,filesystem=fs,format='parquet')
With a remote filesystem URI:
>>>paths=[...'nested/directory/part0/data.parquet',...'nested/directory/part1/data.parquet',...'nested/directory/part3/data.parquet',...]>>>ds.dataset(paths,filesystem='s3://bucket/',...format='parquet')
Similarly to the local example, the directory prefix may be included in thefilesystem URI:
>>>ds.dataset(paths,filesystem='s3://bucket/nested/directory',...format='parquet')
Construction of a nested dataset:
>>>ds.dataset([...dataset("s3://old-taxi-data",format="parquet"),...dataset("local/path/to/data",format="ipc")...])

