Developer#

This section will focus on downstream applications of pandas.

Storing pandas DataFrame objects in Apache Parquet format#

TheApache Parquet formatprovides key-value metadata at the file and column level, stored in the footerof the Parquet file:

5:optionallist<KeyValue>key_value_metadata

whereKeyValue is

structKeyValue{1:requiredstringkey2:optionalstringvalue}

So that apandas.DataFrame can be faithfully reconstructed, we store apandas metadata key in theFileMetaData with the value stored as :

{'index_columns': [<descr0>, <descr1>, ...], 'column_indexes': [<ci0>, <ci1>, ..., <ciN>], 'columns': [<c0>, <c1>, ...], 'pandas_version': $VERSION, 'creator': {   'library': $LIBRARY,   'version': $LIBRARY_VERSION }}

The “descriptor” values<descr0> in the'index_columns' field arestrings (referring to a column) or dictionaries with values as described below.

The<c0>/<ci0> and so forth are dictionaries containing the metadatafor each column,including the index columns. This has JSON form:

{'name': column_name, 'field_name': parquet_column_name, 'pandas_type': pandas_type, 'numpy_type': numpy_type, 'metadata': metadata}

See below for the detailed specification for these.

Index metadata descriptors#

RangeIndex can be stored as metadata only, not requiring serialization. Thedescriptor format for these as is follows:

index=pd.RangeIndex(0,10,2){"kind":"range","name":index.name,"start":index.start,"stop":index.stop,"step":index.step,}

Other index types must be serialized as data columns along with the otherDataFrame columns. The metadata for these is a string indicating the name ofthe field in the data columns, for example'__index_level_0__'.

If an index has a non-Nonename attribute, and there is no other columnwith a name matching that value, then theindex.name value can be used asthe descriptor. Otherwise (for unnamed indexes and ones with names collidingwith other column names) a disambiguating name with pattern matching__index_level_\d+__ should be used. In cases of named indexes as datacolumns,name attribute is always stored in the column descriptors asabove.

Column metadata#

pandas_type is the logical type of the column, and is one of:

  • Boolean:'bool'

  • Integers:'int8','int16','int32','int64','uint8','uint16','uint32','uint64'

  • Floats:'float16','float32','float64'

  • Date and Time Types:'datetime','datetimetz','timedelta'

  • String:'unicode','bytes'

  • Categorical:'categorical'

  • Other Python objects:'object'

Thenumpy_type is the physical storage type of the column, which is theresult ofstr(dtype) for the underlying NumPy array that holds the data. Sofordatetimetz this isdatetime64[ns] and for categorical, it may beany of the supported integer categorical types.

Themetadata field isNone except for:

  • datetimetz:{'timezone':zone,'unit':'ns'}, e.g.{'timezone','America/New_York','unit':'ns'}. The'unit' is optional, and ifomitted it is assumed to be nanoseconds.

  • categorical:{'num_categories':K,'ordered':is_ordered,'type':$TYPE}

    • Here'type' is optional, and can be a nested pandas type specificationhere (but not categorical)

  • unicode:{'encoding':encoding}

    • The encoding is optional, and if not present is UTF-8

  • object:{'encoding':encoding}. Objects can be serialized and storedinBYTE_ARRAY Parquet columns. The encoding can be one of:

    • 'pickle'

    • 'bson'

    • 'json'

  • timedelta:{'unit':'ns'}. The'unit' is optional, and if omittedit is assumed to be nanoseconds. This metadata is optional altogether

For types other than these, the'metadata' key can beomitted. Implementations can assumeNone if the key is not present.

As an example of fully-formed metadata:

{'index_columns': ['__index_level_0__'], 'column_indexes': [     {'name': None,      'field_name': 'None',      'pandas_type': 'unicode',      'numpy_type': 'object',      'metadata': {'encoding': 'UTF-8'}} ], 'columns': [     {'name': 'c0',      'field_name': 'c0',      'pandas_type': 'int8',      'numpy_type': 'int8',      'metadata': None},     {'name': 'c1',      'field_name': 'c1',      'pandas_type': 'bytes',      'numpy_type': 'object',      'metadata': None},     {'name': 'c2',      'field_name': 'c2',      'pandas_type': 'categorical',      'numpy_type': 'int16',      'metadata': {'num_categories': 1000, 'ordered': False}},     {'name': 'c3',      'field_name': 'c3',      'pandas_type': 'datetimetz',      'numpy_type': 'datetime64[ns]',      'metadata': {'timezone': 'America/Los_Angeles'}},     {'name': 'c4',      'field_name': 'c4',      'pandas_type': 'object',      'numpy_type': 'object',      'metadata': {'encoding': 'pickle'}},     {'name': None,      'field_name': '__index_level_0__',      'pandas_type': 'int64',      'numpy_type': 'int64',      'metadata': None} ], 'pandas_version': '1.4.0', 'creator': {   'library': 'pyarrow',   'version': '0.13.0' }}