pyarrow.dataset.HivePartitioning #

classpyarrow.dataset.HivePartitioning(Schemaschema,dictionaries=None,null_fallback='__HIVE_DEFAULT_PARTITION__',segment_encoding='uri')#

Bases:KeyValuePartitioning

A Partitioning for “/$key=$value/” nested directories as found inApache Hive.

Multi-level, directory based partitioning scheme originating fromApache Hive with all data files stored in the leaf directories. Data ispartitioned by static values of a particular column in the schema.Partition keys are represented in the form $key=$value in directory names.Field order is ignored, as are missing or unrecognized field names.

For example, given schema<year:int16, month:int8, day:int8>, a possiblepath would be “/year=2009/month=11/day=15”.

Parameters:

schemaSchema: The schema that describes the partitions present in the file path.
dictionariesdict[str,Array]: If the type of any field ofschema is a dictionary type, thecorresponding entry ofdictionaries must be an array containingevery value which may be taken by the corresponding column or anerror will be raised in parsing.
null_fallbackstr, default “__HIVE_DEFAULT_PARTITION__”: If any field is None then this fallback will be used as a label
segment_encodingstr, default “uri”: After splitting paths into segments, decode the segments. Validvalues are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:

HivePartitioning

Examples

>>>frompyarrow.datasetimportHivePartitioning>>>partitioning=HivePartitioning(...pa.schema([("year",pa.int16()),("month",pa.int8())]))>>>print(partitioning.parse("/year=2009/month=11/"))((year == 2009) and (month == 11))

__init__(*args,**kwargs)#

Methods

`__init__`(args, *kwargs)
`discover`([infer_dictionary, ...])	Discover a HivePartitioning.
`format`(self, expr)	Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme
`parse`(self, path)	Parse a path into a partition expression.

Attributes

`dictionaries`	The unique values for each partition field, if available.
`schema`	The arrow Schema attached to the partitioning.

dictionaries#

The unique values for each partition field, if available.

Those values are only available if the Partitioning object wascreated through dataset discovery from a PartitioningFactory, orif the dictionaries were manually specified in the constructor.If no dictionary field is available, this returns an empty list.

staticdiscover(infer_dictionary=False,max_partition_dictionary_size=0,null_fallback='__HIVE_DEFAULT_PARTITION__',schema=None,segment_encoding='uri')#

Discover a HivePartitioning.

Parameters:

infer_dictionarybool, defaultFalse: When inferring a schema for partition fields, yield dictionaryencoded types instead of plain. This can be more efficient whenmaterializing virtual columns, and Expressions parsed by thefinished Partitioning will include dictionaries of all uniqueinspected values for each field.
max_partition_dictionary_sizeint, default 0: Synonymous with infer_dictionary for backwards compatibility with1.0: setting this to -1 or None is equivalent to passinginfer_dictionary=True.
null_fallbackstr, default “__HIVE_DEFAULT_PARTITION__”: When inferring a schema for partition fields this value will bereplaced by null. The default is set to __HIVE_DEFAULT_PARTITION__for compatibility with Spark
schemaSchema, defaultNone: Use this schema instead of inferring a schema from partitionvalues. Partition values will be validated against this schemabefore accumulation into the Partitioning’s dictionary.
segment_encodingstr, default “uri”: After splitting paths into segments, decode the segments. Validvalues are “uri” (URI-decode segments) and “none” (leave as-is).

Returns:

PartitioningFactory: To be used in the FileSystemFactoryOptions.

format(self,expr)#

Convert a filter expression into a tuple of (directory, filename) usingthe current partitioning scheme

Parameters:

exprpyarrow.dataset.Expression

Returns:

tuple[str,str]

Examples

Specify the Schema for paths like “/2009/June”:

>>>importpyarrowaspa>>>importpyarrow.datasetasds>>>importpyarrow.computeaspc>>>part=ds.partitioning(pa.schema([("year",pa.int16()),...("month",pa.string())]))>>>part.format(...(pc.field("year")==1862)&(pc.field("month")=="Jan")...)('1862/Jan', '')

parse(self,path)#

Parse a path into a partition expression.

Parameters:

pathstr

Returns:

pyarrow.dataset.Expression

schema#: The arrow Schema attached to the partitioning.

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.dataset.HivePartitioning#

pyarrow.dataset.HivePartitioning #