Data Types and In-Memory Data Model#
Apache Arrow defines columnar array data structures by composing type metadatawith memory buffers, like the ones explained in the documentation onMemory and IO. These data structures are exposed in Python througha series of interrelated classes:
Type Metadata: Instances of
pyarrow.DataType, which describe thetype of an array and govern how its values are interpretedSchemas: Instances of
pyarrow.Schema, which describe a namedcollection of types. These can be thought of as the column types in atable-like object.Arrays: Instances of
pyarrow.Array, which are atomic, contiguouscolumnar data structures composed from Arrow Buffer objectsRecord Batches: Instances of
pyarrow.RecordBatch, which are acollection of Array objects with a particular SchemaTables: Instances of
pyarrow.Table, a logical table data structure inwhich each column consists of one or morepyarrow.Arrayobjects of thesame type.
We will examine these in the sections below in a series of examples.
Type Metadata#
Apache Arrow defines language agnostic column-oriented data structures forarray data. These include:
Fixed-length primitive types: numbers, booleans, date and times, fixedsize binary, decimals, and other values that fit into a given number
Variable-length primitive types: binary, string
Nested types: list, map, struct, and union
Dictionary type: An encoded categorical type (more on this later)
Each data type in Arrow has a corresponding factory function for creatingan instance of that type object in Python:
In [1]:importpyarrowaspaIn [2]:t1=pa.int32()In [3]:t2=pa.string()In [4]:t3=pa.binary()In [5]:t4=pa.binary(10)In [6]:t5=pa.timestamp('ms')In [7]:t1Out[7]:DataType(int32)In [8]:print(t1)int32In [9]:print(t4)fixed_size_binary[10]In [10]:print(t5)timestamp[ms]
Note
Different data types might use a given physical storage. For example,int64,float64, andtimestamp[ms] all occupy 64 bits per value.
These objects aremetadata; they are used for describing the data in arrays,schemas, and record batches. In Python, they can be used in functions where theinput data (e.g. Python objects) may be coerced to more than one Arrow type.
TheField type is a type plus a name and optionaluser-defined metadata:
In [11]:f0=pa.field('int32_field',t1)In [12]:f0Out[12]:pyarrow.Field<int32_field: int32>In [13]:f0.nameOut[13]:'int32_field'In [14]:f0.typeOut[14]:DataType(int32)
Arrow supportsnested value types like list, map, struct, and union. Whencreating these, you must pass types or fields to indicate the data types of thetypes’ children. For example, we can define a list of int32 values with:
In [15]:t6=pa.list_(t1)In [16]:t6Out[16]:ListType(list<item: int32>)
Astruct is a collection of named fields:
In [17]:fields=[ ....:pa.field('s0',t1), ....:pa.field('s1',t2), ....:pa.field('s2',t4), ....:pa.field('s3',t6), ....:] ....:In [18]:t7=pa.struct(fields)In [19]:print(t7)struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>
For convenience, you can pass(name,type) tuples directly instead ofField instances:
In [20]:t8=pa.struct([('s0',t1),('s1',t2),('s2',t4),('s3',t6)])In [21]:print(t8)struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>In [22]:t8==t7Out[22]:True
SeeData Types API for a full listing of data typefunctions.
Schemas#
TheSchema type is similar to thestruct array type; itdefines the column names and types in a record batch or table datastructure. Thepyarrow.schema() factory function makes new Schema objects inPython:
In [23]:my_schema=pa.schema([('field0',t1), ....:('field1',t2), ....:('field2',t4), ....:('field3',t6)]) ....:In [24]:my_schemaOut[24]:field0: int32field1: stringfield2: fixed_size_binary[10]field3: list<item: int32> child 0, item: int32
In some applications, you may not create schemas directly, only using the onesthat are embedded inIPC messages.
Schemas are immutable, which means you can’t update an existing schema, but youcan create a new one with updated values usingSchema.set().
In [25]:updated_field=pa.field('field0_new',pa.int64())In [26]:my_schema2=my_schema.set(0,updated_field)In [27]:my_schema2Out[27]:field0_new: int64field1: stringfield2: fixed_size_binary[10]field3: list<item: int32> child 0, item: int32
Arrays#
For each data type, there is an accompanying array data structure for holdingmemory buffers that define a single contiguous chunk of columnar arraydata. When you are using PyArrow, this data may come from IPC tools, though itcan also be created from various types of Python sequences (lists, NumPyarrays, pandas data).
A simple way to create arrays is withpyarrow.array, which is similar tothenumpy.array function. By default PyArrow will infer the data typefor you:
In [28]:arr=pa.array([1,2,None,3])In [29]:arrOut[29]:<pyarrow.lib.Int64Array object at 0x7fe0310679a0>[ 1, 2, null, 3]
But you may also pass a specific data type to override type inference:
In [30]:pa.array([1,2],type=pa.uint16())Out[30]:<pyarrow.lib.UInt16Array object at 0x7fe031067fa0>[ 1, 2]
The array’stype attribute is the corresponding piece of type metadata:
In [31]:arr.typeOut[31]:DataType(int64)
Each in-memory array has a known length and null count (which will be 0 ifthere are no null values):
In [32]:len(arr)Out[32]:4In [33]:arr.null_countOut[33]:1
Scalar values can be selected with normal indexing.pyarrow.array convertsNone values to Arrow nulls; we return the specialpyarrow.NA value fornulls:
In [34]:arr[0]Out[34]:<pyarrow.Int64Scalar: 1>In [35]:arr[2]Out[35]:<pyarrow.Int64Scalar: None>
Arrow data is immutable, so values can be selected but not assigned.
Arrays can be sliced without copying:
In [36]:arr[1:3]Out[36]:<pyarrow.lib.Int64Array object at 0x7fe030fc88e0>[ 2, null]
None values and NAN handling#
As mentioned in the above section, the Python objectNone is alwaysconverted to an Arrow null element on the conversion topyarrow.Array. Forthe float NaN value which is either represented by the Python objectfloat('nan') ornumpy.nan we normally convert it to avalid floatvalue during the conversion. If an integer input is supplied topyarrow.array that containsnp.nan,ValueError is raised.
To handle better compatibility with Pandas, we support interpreting NaN values asnull elements. This is enabled automatically on allfrom_pandas function andcan be enabled on the other conversion functions by passingfrom_pandas=Trueas a function parameter.
List arrays#
pyarrow.array is able to infer the type of simple nested data structureslike lists:
In [37]:nested_arr=pa.array([[],None,[1,2],[None,1]])In [38]:print(nested_arr.type)list<item: int64>
ListView arrays#
pyarrow.array can create an alternate list type called ListView:
In [39]:nested_arr=pa.array([[],None,[1,2],[None,1]],type=pa.list_view(pa.int64()))In [40]:print(nested_arr.type)list_view<item: int64>
ListView arrays have a different set of buffers than List arrays. The ListView arrayhas both an offsets and sizes buffer, while a List array only has an offsets buffer.This allows for ListView arrays to specify out-of-order offsets:
In [41]:values=[1,2,3,4,5,6]In [42]:offsets=[4,2,0]In [43]:sizes=[2,2,2]In [44]:arr=pa.ListViewArray.from_arrays(offsets,sizes,values)In [45]:arrOut[45]:<pyarrow.lib.ListViewArray object at 0x7fe030fc96c0>[ [ 5, 6 ], [ 3, 4 ], [ 1, 2 ]]
See the format specification for more details onListView Layout.
Struct arrays#
pyarrow.array is able to infer the schema of a struct type from arrays ofdictionaries:
In [46]:pa.array([{'x':1,'y':True},{'z':3.4,'x':4}])Out[46]:<pyarrow.lib.StructArray object at 0x7fe030fc9900>-- is_valid: all not null-- child 0 type: int64 [ 1, 4 ]-- child 1 type: bool [ true, null ]-- child 2 type: double [ null, 3.4 ]
Struct arrays can be initialized from a sequence of Python dicts or tuples. For tuples,you must explicitly pass the type:
In [47]:ty=pa.struct([('x',pa.int8()), ....:('y',pa.bool_())]) ....:In [48]:pa.array([{'x':1,'y':True},{'x':2,'y':False}],type=ty)Out[48]:<pyarrow.lib.StructArray object at 0x7fe030fc9f60>-- is_valid: all not null-- child 0 type: int8 [ 1, 2 ]-- child 1 type: bool [ true, false ]In [49]:pa.array([(3,True),(4,False)],type=ty)Out[49]:<pyarrow.lib.StructArray object at 0x7fe030fc9ea0>-- is_valid: all not null-- child 0 type: int8 [ 3, 4 ]-- child 1 type: bool [ true, false ]
When initializing a struct array, nulls are allowed both at the structlevel and at the individual field level. If initializing from a sequenceof Python dicts, a missing dict key is handled as a null value:
In [50]:pa.array([{'x':1},None,{'y':None}],type=ty)Out[50]:<pyarrow.lib.StructArray object at 0x7fe030fca140>-- is_valid: [ true, false, true ]-- child 0 type: int8 [ 1, 0, null ]-- child 1 type: bool [ null, false, null ]
You can also construct a struct array from existing arrays for each of thestruct’s components. In this case, data storage will be shared with theindividual arrays, and no copy is involved:
In [51]:xs=pa.array([5,6,7],type=pa.int16())In [52]:ys=pa.array([False,True,True])In [53]:arr=pa.StructArray.from_arrays((xs,ys),names=('x','y'))In [54]:arr.typeOut[54]:StructType(struct<x: int16, y: bool>)In [55]:arrOut[55]:<pyarrow.lib.StructArray object at 0x7fe030fcab60>-- is_valid: all not null-- child 0 type: int16 [ 5, 6, 7 ]-- child 1 type: bool [ false, true, true ]
Map arrays#
Map arrays can be constructed from lists of lists of tuples (key-item pairs), but only ifthe type is explicitly passed intoarray():
In [56]:data=[[('x',1),('y',0)],[('a',2),('b',45)]]In [57]:ty=pa.map_(pa.string(),pa.int64())In [58]:pa.array(data,type=ty)Out[58]:<pyarrow.lib.MapArray object at 0x7fe030fcb040>[ keys: [ "x", "y" ] values: [ 1, 0 ], keys: [ "a", "b" ] values: [ 2, 45 ]]
MapArrays can also be constructed from offset, key, and item arrays. Offsets represent thestarting position of each map. Note that theMapArray.keys andMapArray.itemsproperties give theflattened keys and items. To keep the keys and items associated totheir row, use theListArray.from_arrays() constructor with theMapArray.offsets property.
In [59]:arr=pa.MapArray.from_arrays([0,2,3],['x','y','z'],[4,5,6])In [60]:arr.keysOut[60]:<pyarrow.lib.StringArray object at 0x7fe030fcb3a0>[ "x", "y", "z"]In [61]:arr.itemsOut[61]:<pyarrow.lib.Int64Array object at 0x7fe030fcb4c0>[ 4, 5, 6]In [62]:pa.ListArray.from_arrays(arr.offsets,arr.keys)Out[62]:<pyarrow.lib.ListArray object at 0x7fe030fcb700>[ [ "x", "y" ], [ "z" ]]In [63]:pa.ListArray.from_arrays(arr.offsets,arr.items)Out[63]:<pyarrow.lib.ListArray object at 0x7fe030fcb640>[ [ 4, 5 ], [ 6 ]]
Union arrays#
The union type represents a nested array type where each value can be one(and only one) of a set of possible types. There are two possiblestorage types for union arrays: sparse and dense.
In a sparse union array, each of the child arrays has the same lengthas the resulting union array. They are adjuncted with aint8 “types”array that tells, for each value, from which child array it must beselected:
In [64]:xs=pa.array([5,6,7])In [65]:ys=pa.array([False,False,True])In [66]:types=pa.array([0,1,1],type=pa.int8())In [67]:union_arr=pa.UnionArray.from_sparse(types,[xs,ys])In [68]:union_arr.typeOut[68]:SparseUnionType(sparse_union<0: int64=0, 1: bool=1>)In [69]:union_arrOut[69]:<pyarrow.lib.UnionArray object at 0x7fe030fcbb20>-- is_valid: all not null-- type_ids: [ 0, 1, 1 ]-- child 0 type: int64 [ 5, 6, 7 ]-- child 1 type: bool [ false, false, true ]
In a dense union array, you also pass, in addition to theint8 “types”array, aint32 “offsets” array that tells, for each value, ateach offset in the selected child array it can be found:
In [70]:xs=pa.array([5,6,7])In [71]:ys=pa.array([False,True])In [72]:types=pa.array([0,1,1,0,0],type=pa.int8())In [73]:offsets=pa.array([0,0,1,1,2],type=pa.int32())In [74]:union_arr=pa.UnionArray.from_dense(types,offsets,[xs,ys])In [75]:union_arr.typeOut[75]:DenseUnionType(dense_union<0: int64=0, 1: bool=1>)In [76]:union_arrOut[76]:<pyarrow.lib.UnionArray object at 0x7fe030e24340>-- is_valid: all not null-- type_ids: [ 0, 1, 1, 0, 0 ]-- value_offsets: [ 0, 0, 1, 1, 2 ]-- child 0 type: int64 [ 5, 6, 7 ]-- child 1 type: bool [ false, true ]
Dictionary Arrays#
TheDictionary type in PyArrow is a special array type that is similar to afactor in R or apandas.Categorical. It enables one or more record batchesin a file or stream to transmit integerindices referencing a shareddictionary containing the distinct values in the logical array. This isparticularly often used with strings to save memory and improve performance.
The way that dictionaries are handled in the Apache Arrow format and the waythey appear in C++ and Python is slightly different. We define a specialDictionaryArray type with a corresponding dictionary type. Let’sconsider an example:
In [77]:indices=pa.array([0,1,0,1,2,0,None,2])In [78]:dictionary=pa.array(['foo','bar','baz'])In [79]:dict_array=pa.DictionaryArray.from_arrays(indices,dictionary)In [80]:dict_arrayOut[80]:<pyarrow.lib.DictionaryArray object at 0x7fe030f97bc0>-- dictionary: [ "foo", "bar", "baz" ]-- indices: [ 0, 1, 0, 1, 2, 0, null, 2 ]
Here we have:
In [81]:print(dict_array.type)dictionary<values=string, indices=int64, ordered=0>In [82]:dict_array.indicesOut[82]:<pyarrow.lib.Int64Array object at 0x7fe030e24a60>[ 0, 1, 0, 1, 2, 0, null, 2]In [83]:dict_array.dictionaryOut[83]:<pyarrow.lib.StringArray object at 0x7fe030e24700>[ "foo", "bar", "baz"]
When usingDictionaryArray with pandas, the analogue ispandas.Categorical (more on this later):
In [84]:dict_array.to_pandas()Out[84]:0 foo1 bar2 foo3 bar4 baz5 foo6 NaN7 bazdtype: categoryCategories (3, object): ['foo', 'bar', 'baz']
Record Batches#
ARecord Batch in Apache Arrow is a collection of equal-length arrayinstances. Let’s consider a collection of arrays:
In [85]:data=[ ....:pa.array([1,2,3,4]), ....:pa.array(['foo','bar','baz',None]), ....:pa.array([True,None,False,True]) ....:] ....:
A record batch can be created from this list of arrays usingRecordBatch.from_arrays:
In [86]:batch=pa.RecordBatch.from_arrays(data,['f0','f1','f2'])In [87]:batch.num_columnsOut[87]:3In [88]:batch.num_rowsOut[88]:4In [89]:batch.schemaOut[89]:f0: int64f1: stringf2: boolIn [90]:batch[1]Out[90]:<pyarrow.lib.StringArray object at 0x7fe030e25ba0>[ "foo", "bar", "baz", null]
A record batch can be sliced without copying memory like an array:
In [91]:batch2=batch.slice(1,3)In [92]:batch2[1]Out[92]:<pyarrow.lib.StringArray object at 0x7fe030e25d80>[ "bar", "baz", null]
Tables#
The PyArrowTable type is not part of the Apache Arrowspecification, but is rather a tool to help with wrangling multiple recordbatches and array pieces as a single logical dataset. As a relevant example, wemay receive multiple small record batches in a socket stream, then need toconcatenate them into contiguous memory for use in NumPy or pandas. The Tableobject makes this efficient without requiring additional memory copying.
Considering the record batch we created above, we can create a Table containingone or more copies of the batch usingTable.from_batches:
In [93]:batches=[batch]*5In [94]:table=pa.Table.from_batches(batches)In [95]:tableOut[95]:pyarrow.Tablef0: int64f1: stringf2: bool----f0: [[1,2,3,4],[1,2,3,4],...,[1,2,3,4],[1,2,3,4]]f1: [["foo","bar","baz",null],["foo","bar","baz",null],...,["foo","bar","baz",null],["foo","bar","baz",null]]f2: [[true,null,false,true],[true,null,false,true],...,[true,null,false,true],[true,null,false,true]]In [96]:table.num_rowsOut[96]:20
The table’s columns are instances ofChunkedArray, which is acontainer for one or more arrays of the same type.
In [97]:c=table[0]In [98]:cOut[98]:<pyarrow.lib.ChunkedArray object at 0x7fe030e260e0>[ [ 1, 2, 3, 4 ], [ 1, 2, 3, 4 ],..., [ 1, 2, 3, 4 ], [ 1, 2, 3, 4 ]]In [99]:c.num_chunksOut[99]:5In [100]:c.chunk(0)Out[100]:<pyarrow.lib.Int64Array object at 0x7fe030e26440>[ 1, 2, 3, 4]
As you’ll see in thepandas section, we can convertthese objects to contiguous NumPy arrays for use in pandas:
In [101]:c.to_pandas()Out[101]:0 11 22 33 44 15 26 37 48 19 210 311 412 113 214 315 416 117 218 319 4Name: f0, dtype: int64
Multiple tables can also be concatenated together to form a single table usingpyarrow.concat_tables, if the schemas are equal:
In [102]:tables=[table]*2In [103]:table_all=pa.concat_tables(tables)In [104]:table_all.num_rowsOut[104]:40In [105]:c=table_all[0]In [106]:c.num_chunksOut[106]:10
This is similar toTable.from_batches, but uses tables as input instead ofrecord batches. Record batches can be made into tables, but not the other wayaround, so if your data is already in table form, then usepyarrow.concat_tables.
Custom Schema and Field Metadata#
Arrow supports both schema-level and field-level custom key-value metadataallowing for systems to insert their own application defined metadata tocustomize behavior.
Custom metadata can be accessed atSchema.metadata for the schema-levelandField.metadata for the field-level.
Note that this metadata is preserved inStreaming, Serialization, and IPC processes.
To customize the schema metadata of an existing table you can useTable.replace_schema_metadata():
In [107]:table.schema.metadata# emptyIn [108]:table=table.replace_schema_metadata({"f0":"First dose"})In [109]:table.schema.metadataOut[109]:{b'f0': b'First dose'}
To customize the metadata of the field from the table schema you can useField.with_metadata():
In [110]:field_f1=table.schema.field("f1")In [111]:field_f1.metadata# emptyIn [112]:field_f1=field_f1.with_metadata({"f1":"Second dose"})In [113]:field_f1.metadataOut[113]:{b'f1': b'Second dose'}
Both options create a shallow copy of the data and do not in fact change theSchema which is immutable. To change the metadata in the schema of the tablewe created a new object when callingTable.replace_schema_metadata().
To change the metadata of the field in the schema we would need to definea new schema and cast the data to this schema:
In [114]:my_schema2=pa.schema([ .....:pa.field('f0',pa.int64(),metadata={"name":"First dose"}), .....:pa.field('f1',pa.string(),metadata={"name":"Second dose"}), .....:pa.field('f2',pa.bool_())], .....:metadata={"f2":"booster"}) .....:In [115]:t2=table.cast(my_schema2)In [116]:t2.schema.field("f0").metadataOut[116]:{b'name': b'First dose'}In [117]:t2.schema.field("f1").metadataOut[117]:{b'name': b'Second dose'}In [118]:t2.schema.metadataOut[118]:{b'f2': b'booster'}
Metadata key and value pairs arestd::string objects in the C++ implementationand so they are bytes objects (b'...') in Python.
Record Batch Readers#
Many functions in PyArrow either return or take as an argument aRecordBatchReader.It can be used like any iterable of record batches, but also provides their commonschema without having to get any of the batches.:
>>>schema=pa.schema([('x',pa.int64())])>>>defiter_record_batches():...foriinrange(2):...yieldpa.RecordBatch.from_arrays([pa.array([1,2,3])],schema=schema)>>>reader=pa.RecordBatchReader.from_batches(schema,iter_record_batches())>>>print(reader.schema)pyarrow.Schemax: int64>>>forbatchinreader:...print(batch)pyarrow.RecordBatchx: int64pyarrow.RecordBatchx: int64
It can also be sent between languages using theC stream interface.
Conversion of RecordBatch to Tensor#
Each array of theRecordBatch has it’s own contiguous memory that is not necessarilyadjacent to other arrays. A different memory structure that is used in machine learninglibraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takesonly one contiguous block of memory.
For this reason there is a functionpyarrow.RecordBatch.to_tensor() availableto efficiently convert tabular columnar data into a tensor.
Data types supported in this conversion are unsigned, signed integer and floattypes. Currently only column-major conversion is supported.
>>>importpyarrowaspa>>>arr1=[1,2,3,4,5]>>>arr2=[10,20,30,40,50]>>>batch=pa.RecordBatch.from_arrays(...[...pa.array(arr1,type=pa.uint16()),...pa.array(arr2,type=pa.int16()),...],["a","b"]...)>>>batch.to_tensor()<pyarrow.Tensor>type: int32shape: (9, 2)strides: (4, 36)>>>batch.to_tensor().to_numpy()array([[ 1, 10], [ 2, 20], [ 3, 30], [ 4, 40], [ 5, 50]], dtype=int32)
Withnull_to_nan set toTrue one can also convert data withnulls. They will be converted toNaN:
>>>importpyarrowaspa>>>batch=pa.record_batch(...[...pa.array([1,2,3,4,None],type=pa.int32()),...pa.array([10,20,30,40,None],type=pa.float32()),...],names=["a","b"]...)>>>batch.to_tensor(null_to_nan=True).to_numpy()array([[ 1., 10.], [ 2., 20.], [ 3., 30.], [ 4., 40.], [nan, nan]])

