pyarrow.dataset.DirectoryPartitioning#

classpyarrow.dataset.DirectoryPartitioning(Schemaschema,dictionaries=None,segment_encoding='uri')#

Bases:KeyValuePartitioning

A Partitioning based on a specified Schema.

The DirectoryPartitioning expects one segment in the file path for eachfield in the schema (all fields are required to be present).For example given schema<year:int16, month:int8> the path “/2009/11” wouldbe parsed to (“year”_ == 2009 and “month”_ == 11).

Parameters:
schemaSchema

The schema that describes the partitions present in the file path.

dictionariesdict[str,Array]

If the type of any field ofschema is a dictionary type, thecorresponding entry ofdictionaries must be an array containingevery value which may be taken by the corresponding column or anerror will be raised in parsing.

segment_encodingstr, default “uri”

After splitting paths into segments, decode the segments. Validvalues are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:
DirectoryPartitioning

Examples

>>>frompyarrow.datasetimportDirectoryPartitioning>>>partitioning=DirectoryPartitioning(...pa.schema([("year",pa.int16()),("month",pa.int8())]))>>>print(partitioning.parse("/2009/11/"))((year == 2009) and (month == 11))
__init__(*args,**kwargs)#

Methods

__init__(*args, **kwargs)

discover([field_names, infer_dictionary, ...])

Discover a DirectoryPartitioning.

format(self, expr)

Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme

parse(self, path)

Parse a path into a partition expression.

Attributes

dictionaries

The unique values for each partition field, if available.

schema

The arrow Schema attached to the partitioning.

dictionaries#

The unique values for each partition field, if available.

Those values are only available if the Partitioning object wascreated through dataset discovery from a PartitioningFactory, orif the dictionaries were manually specified in the constructor.If no dictionary field is available, this returns an empty list.

staticdiscover(field_names=None,infer_dictionary=False,max_partition_dictionary_size=0,schema=None,segment_encoding='uri')#

Discover a DirectoryPartitioning.

Parameters:
field_nameslist ofstr

The names to associate with the values from the subdirectory names.If schema is given, will be populated from the schema.

infer_dictionarybool, defaultFalse

When inferring a schema for partition fields, yield dictionaryencoded types instead of plain types. This can be more efficientwhen materializing virtual columns, and Expressions parsed by thefinished Partitioning will include dictionaries of all uniqueinspected values for each field.

max_partition_dictionary_sizeint, default 0

Synonymous with infer_dictionary for backwards compatibility with1.0: setting this to -1 or None is equivalent to passinginfer_dictionary=True.

schemaSchema, defaultNone

Use this schema instead of inferring a schema from partitionvalues. Partition values will be validated against this schemabefore accumulation into the Partitioning’s dictionary.

segment_encodingstr, default “uri”

After splitting paths into segments, decode the segments. Validvalues are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:
PartitioningFactory

To be used in the FileSystemFactoryOptions.

format(self,expr)#

Convert a filter expression into a tuple of (directory, filename) usingthe current partitioning scheme

Parameters:
exprpyarrow.dataset.Expression
Returns:
tuple[str,str]

Examples

Specify the Schema for paths like “/2009/June”:

>>>importpyarrowaspa>>>importpyarrow.datasetasds>>>importpyarrow.computeaspc>>>part=ds.partitioning(pa.schema([("year",pa.int16()),...("month",pa.string())]))>>>part.format(...(pc.field("year")==1862)&(pc.field("month")=="Jan")...)('1862/Jan', '')
parse(self,path)#

Parse a path into a partition expression.

Parameters:
pathstr
Returns:
pyarrow.dataset.Expression
schema#

The arrow Schema attached to the partitioning.