Canonical Extension Types #

Introduction#

The Arrow columnar format allows definingextension types so as to extendstandard Arrow data types with custom semantics. Often these semanticswill be specific to a system or application. However, it is beneficialto share the definitions of well-known extension types so as to improveinteroperability between different systems integrating Arrow columnar data.

Standardization#

These rules must be followed for the standardization of canonical extensiontypes:

Canonical extension types are described and maintained below in this document.
Each canonical extension type requires a distinct discussion and voteon theArrow development mailing-list.
The specification text to be addedmust follow these requirements:
1. Itmust define a well-defined extension name starting with “arrow.”.
2. Its parameters, if any,must be described in the proposal.
3. Its serializationmust be described in the proposal and shouldnot require unduly implementation work or unusual software dependencies(for example, a trivial custom text format or a JSON-based format would be acceptable).
4. Its expected semanticsshould be described as well and anypotential ambiguities or pain points addressed or at least mentioned.
The extension typeshould have one implementation submitted;preferably two if non-trivial (for example if parameterized).

Making Modifications#

Like standard Arrow data types, canonical extension types should be consideredstable once standardized. Modifying a canonical extension type (for exampleto expand the set of parameters) should be an exceptional event, follow thesame rules as laid out above, and provide backwards compatibility guarantees.

Official List#

Fixed shape tensor#

Extension name:arrow.fixed_shape_tensor.
The storage type of the extension:FixedSizeList where:
- value_type is the data type of individual tensor elements.
- list_size is the product of all the elements in tensor shape.
Extension type parameters:
- value_type = the Arrow data type of individual tensor elements.
- shape = the physical shape of the contained tensorsas an array.
Optional parameters describing the logical layout:
- dim_names = explicit names to tensor dimensionsas an array. The length of it should be equal to the shapelength and equal to the number of dimensions.
  dim_names can be used if the dimensions have well-knownnames and they map to the physical layout (row-major).
- permutation = indices of the desired ordering of theoriginal dimensions, defined as an array.
  The indices contain a permutation of the values [0, 1, .., N-1] whereN is the number of dimensions. The permutation indicates whichdimension of the logical layout corresponds to which dimension of thephysical tensor (the i-th dimension of the logical view correspondsto the dimension with numberpermutations[i] of the physical tensor).
  Permutation can be useful in case the logical order ofthe tensor is a permutation of the physical order (row-major).
  When logical and physical layout are equal, the permutation will alwaysbe ([0, 1, .., N-1]) and can therefore be left out.
Description of the serialization:
The metadata must be a valid JSON object including shape ofthe contained tensors as an array with key“shape” plus optionaldimension names with keys“dim_names” and ordering of thedimensions with key“permutation”.
- Example:{"shape":[2,5]}
- Example withdim_names metadata for NCHW ordered data:
  {"shape":[100,200,500],"dim_names":["C","H","W"]}
- Example of permuted 3-dimensional tensor:
  {"shape":[100,200,500],"permutation":[2,0,1]}
  This is the physical layout shape and the shape of the logicallayout would in this case be[500,100,200].

Note

Elements in a fixed shape tensor extension array are storedin row-major/C-contiguous order.

Note

Other Data Structures in Arrow include aTensor (Multi-dimensional Array)to be used as a message in the interprocess communication machinery (IPC).

This structure has no relationship with the Fixed shape tensor extension type definedby this specification. Instead, this extension type lets one use fixed shape tensorsas elements in a field of a RecordBatch or a Table.

Variable shape tensor#

Extension name:arrow.variable_shape_tensor.
The storage type of the extension is:StructArray where structis composed ofdata andshape fields describing a singletensor per row:
- data is aList holding tensor elements (each list element isa single tensor). The List’s value type is the value type of the tensor,such as an integer or floating-point type.
- shape is aFixedSizeList<int32>[ndim] of the tensor shape wherethe size of the listndim is equal to the number of dimensions of thetensor.
Extension type parameters:
- value_type = the Arrow data type of individual tensor elements.
Optional parameters describing the logical layout:
- dim_names = explicit names to tensor dimensionsas an array. The length of it should be equal to the shapelength and equal to the number of dimensions.
  dim_names can be used if the dimensions have well-knownnames and they map to the physical layout (row-major).
- permutation = indices of the desired ordering of theoriginal dimensions, defined as an array.
  The indices contain a permutation of the values [0, 1, .., N-1] whereN is the number of dimensions. The permutation indicates whichdimension of the logical layout corresponds to which dimension of thephysical tensor (the i-th dimension of the logical view correspondsto the dimension with numberpermutations[i] of the physical tensor).
  Permutation can be useful in case the logical order ofthe tensor is a permutation of the physical order (row-major).
  When logical and physical layout are equal, the permutation will alwaysbe ([0, 1, .., N-1]) and can therefore be left out.
- uniform_shape = sizes of individual tensor’s dimensions which areguaranteed to stay constant in uniform dimensions and can vary innon-uniform dimensions. This holds over all tensors in the array.Sizes in uniform dimensions are represented with int32 values, whilesizes of the non-uniform dimensions are not known in advance and arerepresented with null. Ifuniform_shape is not provided it is assumedthat all dimensions are non-uniform.An array containing a tensor with shape (2, 3, 4) and whose first andlast dimensions are uniform would haveuniform_shape (2, null, 4).This allows for interpreting the tensor correctly without accounting foruniform dimensions while still permitting optional optimizations thattake advantage of the uniformity.
Description of the serialization:
The metadata must be a valid JSON object that optionally includesdimension names with keys“dim_names” and ordering of dimensionswith key“permutation”.Shapes of tensors can be defined in a subset of dimensions by providingkey“uniform_shape”.Minimal metadata is an empty string.
- Example withdim_names metadata for NCHW ordered data (note that the firstlogical dimension,N, is mapped to thedata List array: each element in the Listis a CHW tensor and the List of tensors implicitly constitutes a single NCHW tensor):
  {"dim_names":["C","H","W"]}
- Example withuniform_shape metadata for a set of color imageswith fixed height, variable width and three color channels:
  {"dim_names":["H","W","C"],"uniform_shape":[400,null,3]}
- Example of permuted 3-dimensional tensor:
  {"permutation":[2,0,1]}
  For example, if the physicalshape of an individual tensoris[100,200,500], this permutation would denote a logical shapeof[500,100,200].

Note

With the exception ofpermutation, the parameters and storageof VariableShapeTensor relate to thephysical storage of the tensor.

For example, consider a tensor with::: shape = [10, 20, 30]dim_names = [x, y, z]permutations = [2, 0, 1]

This means the logical tensor has names [z, x, y] and shape [30, 10, 20].

Note

Values inside eachdata tensor element are stored in row-major/C-contiguousorder according to the correspondingshape.

JSON#

Extension name:arrow.json.
The storage type of this extension isString orLargeString orStringView.Only UTF-8 encoded JSON as specified inrfc8259 is supported.
Extension type parameters:
This type does not have any parameters.
Description of the serialization:
Metadata is either an empty string or a JSON string with an empty object.In the future, additional fields may be added, but they are not requiredto interpret the array.

UUID#

Extension name:arrow.uuid.
The storage type of the extension isFixedSizeBinary with a length of 16 bytes.

Note

A specific UUID version is not required or guaranteed. This extension representsUUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.

Opaque#

Opaque represents a type that an Arrow-based system received from an external(often non-Arrow) system, but that it cannot interpret. In this case, it canpass on Opaque to its clients to at least show that a field exists andpreserve metadata about the type from the other system.

Extension parameters:

Extension name:arrow.opaque.
The storage type of this extension is any type. If there is no underlyingdata, the storage type should be Null.
Extension type parameters:
- type_name = the name of the unknown type in the external system.
- vendor_name = the name of the external system.
Description of the serialization:
A valid JSON object containing the parameters as fields. In the future,additional fields may be added, but all fields current and future are neverrequired to interpret the array.
Developersshould not attempt to enable public semantic interoperabilityof Opaque by canonicalizing specific values of these parameters.

Rationale#

Interfacing with non-Arrow systems requires a way to handle data that doesn’thave an equivalent Arrow type. In this case, use the Opaque type, whichexplicitly represents an unsupported field. Other solutions are inadequate:

Raising an error means even one unsupported field makes all operationsimpossible, even if (for instance) the user is just trying to view a schema.
Dropping unsupported columns misleads the user as to the actual schema.
An extension type may not exist for the unsupported type.
Generating an extension type on the fly would falsely imply support.

Applicationsshould not make conventions around vendor_name and type_name.These parameters are meant for human end users to understand what type wasn’tsupported. Applications may try to interpret these fields, but must beprepared for breakage (e.g., when the type becomes supported with a customextension type later on). Similarly,Opaque is not a generic container forfile formats. Considerations such as MIME types are irrelevant. In both ofthese cases, create a custom extension type instead.

Examples:

A Flight SQL service that supports connecting external databases mayencounter columns with unsupported types in external tables. In this case,it can use the Opaque[Null] type to at least report that a column existswith a particular name and type name. This lets clients know that a columnexists, but is not supported. Null is used as the storage type here becauseonly schemas are involved.
An example of the extension metadata would be:
```
{"type_name":"varray","vendor_name":"Oracle"}
```
The ADBC PostgreSQL driver gets results as a series of length-prefixed bytefields. But the driver will not always know how to parse the bytes, asthere may be extensions (e.g. PostGIS). It can use Opaque[Binary] to stillreturn those bytes to the application, which may be able to parse the dataitself. Opaque differentiates the column from an actual binary column andmakes it clear that the value is directly from PostgreSQL. (A customextension type is preferred, but there will always be extensions that thedriver does not know about.)
An example of the extension metadata would be:
```
{"type_name":"geometry","vendor_name":"PostGIS"}
```
The ADBC PostgreSQL driver may also know how to parse the bytes, but notknow the intended semantics. For example,composite types can add newsemantics to existing types, somewhat like Arrow extension types. Thedriver would be able to parse the underlying bytes in this case, but wouldstill use the Opaque type.
Consider the example in the PostgreSQL documentation of acomplex type.Mapping the type to a plain Arrowstruct type would lose meaning, justlike how an Arrow system deciding to treat all extension types by droppingthe extension metadata would be undesirable. Instead, the driver can useOpaque[Struct] to pass on the composite type info. (It would be wrong totry to map this to an Arrow-defined complex type: it does not know theproper semantics of a user-defined type, which cannot and should not behardcoded into the driver in the first place.)
An example of the extension metadata would be:
```
{"type_name":"database_name.schema_name.complex","vendor_name":"PostgreSQL"}
```
The JDBC adapter in the Arrow Java libraries converts JDBC result sets intoArrow arrays, and can get Arrow schemas from result sets. JDBC, however,allows drivers to returnarbitrary Java objects.
The driver can use Opaque[Null] as a placeholder during schema conversion,only erroring if the application tries to fetch the actual data. That way,clients can at least introspect result schemas to decide whether it canproceed to fetch the data, or only query certain columns.
An example of the extension metadata would be:
```
{"type_name":"OTHER","vendor_name":"JDBC driver name"}
```

8-bit Boolean#

Bool8 represents a boolean value using 1 byte (8 bits) to store each value instead of only 1 bit as inthe original Arrow Boolean type. Although less compact than the original representation, Bool8 may havebetter zero-copy compatibility with various systems that also store booleans using 1 byte.

Extension name:arrow.bool8.
The storage type of this extension isInt8 where:
- false is denoted by the value0.
- true can be specified using any non-zero value. Preferably1.
Extension type parameters:
This type does not have any parameters.
Description of the serialization:
Metadata is an empty string.

Parquet Variant#

Variant represents a value that may be one of:

Primitive: a type and corresponding value (e.g.INT,STRING)
Array: An ordered list of Variant values
Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys

Particularly, this provides a way to represent semi-structured data which is stored as aParquet Variant value within Arrow columns ina lossless fashion. This also provides the ability to representshreddedvariant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unlessthey want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actualbinary values look like.

Extension name:arrow.parquet.variant.
The storage type of this extension is aStruct that obeys the following rules:
- Anon-nullable field namedmetadata which is of typeBinary,LargeBinary, orBinaryView.
- At least one (or both) of the following:
  - A field namedvalue which is of typeBinary,LargeBinary, orBinaryView.(unshredded variants consist of just themetadata andvalue fields only)
  - A field namedtyped_value which can be aPrimitive Type Mappings or aList,LargeList,ListView orStruct
    - If thetyped_value field is aList,LargeList orListView its elementsmust benon-nullable andmustbe aStruct consisting of at least one (or both) of the following:
      - A field namedvalue which is of typeBinary,LargeBinary, orBinaryView.
      - A field namedtyped_value which follows the rules outlined above (this allows for arbitrarily nested data).
    - If thetyped_value field is aStruct, then its fieldsmust benon-nullable, representing the fields being shreddedfrom the objects, andmust be aStruct consisting of at least one (or both) of the following:
      - A field namedvalue which is of typeBinary,LargeBinary, orBinaryView.
      - A field namedtyped_value which follows the rules outlined above (this allows for arbitrarily nested data).
Extension type parameters:
This type does not have any parameters.
Description of the serialization:
Extension metadata is an empty string.

Note

It is alsopermissible for themetadata field to be dictionary-encoded with a preferred (but not required) index type ofint8,or run-end-encoded with a preferred (but not required) runs type ofint8.

Note

The fields may be in any order, and thus must be accessed byname not byposition. The field names are case sensitive.

Primitive Type Mappings#

Arrow Primitive Type	Variant Primitive Type
Null	Null
Boolean	Boolean (true/false)
Int8	Int8
Uint8	Int16
Int16	Int16
Uint16	Int32
Int32	Int32
Uint32	Int64
Int64	Int64
Float	Float
Double	Double
Decimal32	decimal4
Decimal64	decimal8
Decimal128	decimal16
Date32	Date
Time64	TimeNTZ
Timestamp(us, UTC)	Timestamp (micro)
Timestamp(us)	TimestampNTZ (micro)
Timestamp(ns, UTC)	Timestamp (nano)
Timestamp(ns)	TimestampNTZ (nano)
Binary	Binary
LargeBinary	Binary
BinaryView	Binary
String	String
LargeString	String
StringView	String
UUID extension type	UUID

Community Extension Types#

In addition to the canonical extension types listed above, there exist Arrowextension types that have been established as standards within specific domainareas. These have not been officially designated as canonical through adiscussion and vote on the Arrow development mailing list but are well knownwithin subcommunities of Arrow developers.

GeoArrow#

GeoArrow defines a collection ofArrow extension types for representing vector geometries. It is well knownwithin the Arrow geospatial subcommunity. The GeoArrow specification is not yetfinalized.

On this page

Edit on GitHub

Movatterモバイル変換

Canonical Extension Types#

Introduction#

Standardization#

Making Modifications#

Official List#

Fixed shape tensor#

Variable shape tensor#

JSON#

UUID#

Opaque#

Rationale#

8-bit Boolean#

Parquet Variant#

Primitive Type Mappings#

Community Extension Types#

GeoArrow#

Canonical Extension Types #