PyArrow Functionality #

pandas can utilizePyArrow to extend functionality and improve the performanceof various APIs. This includes:

More extensivedata types compared to NumPy
Missing data support (NA) for all data types
Performant IO reader integration
Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

To use this functionality, please ensure you haveinstalled the minimum supported PyArrow version.

Data Structure Integration#

ASeries,Index, or the columns of aDataFrame can be directly backed by apyarrow.ChunkedArraywhich is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by[pyarrow], e.g."int64[pyarrow]"" into thedtype parameter

In [1]:ser=pd.Series([-1.5,0.2,None],dtype="float32[pyarrow]")In [2]:serOut[2]:0    -1.51     0.22    <NA>dtype: float[pyarrow]In [3]:idx=pd.Index([True,None],dtype="bool[pyarrow]")In [4]:idxOut[4]:Index([True, <NA>], dtype='bool[pyarrow]')In [5]:df=pd.DataFrame([[1,2],[3,4]],dtype="uint64[pyarrow]")In [6]:dfOut[6]:   0  10  1  21  3  4

Note

The string alias"string[pyarrow]" maps topd.StringDtype("pyarrow") which is not equivalent tospecifyingdtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarlyexceptpd.StringDtype("pyarrow") can return NumPy-backed nullable types whilepd.ArrowDtype(pa.string())will returnArrowDtype.

In [7]:importpyarrowaspaIn [8]:data=list("abc")In [9]:ser_sd=pd.Series(data,dtype="string[pyarrow]")In [10]:ser_ad=pd.Series(data,dtype=pd.ArrowDtype(pa.string()))In [11]:ser_ad.dtype==ser_sd.dtypeOut[11]:FalseIn [12]:ser_sd.str.contains("a")Out[12]:0     True1    False2    Falsedtype: booleanIn [13]:ser_ad.str.contains("a")Out[13]:0     True1    False2    Falsedtype: bool[pyarrow]

For PyArrow types that accept parameters, you can pass in a PyArrow type with those parametersintoArrowDtype to use in thedtype parameter.

In [14]:importpyarrowaspaIn [15]:list_str_type=pa.list_(pa.string())In [16]:ser=pd.Series([["hello"],["there"]],dtype=pd.ArrowDtype(list_str_type))In [17]:serOut[17]:0    ['hello']1    ['there']dtype: list<item: string>[pyarrow]

In [18]:fromdatetimeimporttimeIn [19]:idx=pd.Index([time(12,30),None],dtype=pd.ArrowDtype(pa.time64("us")))In [20]:idxOut[20]:Index([12:30:00, <NA>], dtype='time64[us][pyarrow]')

In [21]:fromdecimalimportDecimalIn [22]:decimal_type=pd.ArrowDtype(pa.decimal128(3,scale=2))In [23]:data=[[Decimal("3.19"),None],[None,Decimal("-1.23")]]In [24]:df=pd.DataFrame(data,dtype=decimal_type)In [25]:dfOut[25]:      0      10  3.19   <NA>1  <NA>  -1.23

If you already have anpyarrow.Array orpyarrow.ChunkedArray,you can pass it intoarrays.ArrowExtensionArray to construct the associatedSeries,IndexorDataFrame object.

In [26]:pa_array=pa.array(   ....:[{"1":"2"},{"10":"20"},None],   ....:type=pa.map_(pa.string(),pa.string()),   ....:)   ....:In [27]:ser=pd.Series(pd.arrays.ArrowExtensionArray(pa_array))In [28]:serOut[28]:0      [('1', '2')]1    [('10', '20')]2              <NA>dtype: map<string, string>[pyarrow]

To retrieve a pyarrowpyarrow.ChunkedArray from aSeries orIndex, you can callthe pyarrow array constructor on theSeries orIndex.

In [29]:ser=pd.Series([1,2,None],dtype="uint8[pyarrow]")In [30]:pa.array(ser)Out[30]:<pyarrow.lib.UInt8Array object at 0x7f1087e61780>[  1,  2,  null]In [31]:idx=pd.Index(ser)In [32]:pa.array(idx)Out[32]:<pyarrow.lib.UInt8Array object at 0x7f1087e3a380>[  1,  2,  null]

To convert apyarrow.Table to aDataFrame, you can call thepyarrow.Table.to_pandas() method withtypes_mapper=pd.ArrowDtype.

In [33]:table=pa.table([pa.array([1,2,3],type=pa.int64())],names=["a"])In [34]:df=table.to_pandas(types_mapper=pd.ArrowDtype)In [35]:dfOut[35]:   a0  11  22  3In [36]:df.dtypesOut[36]:a    int64[pyarrow]dtype: object

Operations#

PyArrow data structure integration is implemented through pandas’ExtensionArrayinterface;therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionalityis accelerated with PyArrowcompute functions where available. This includes:

Numeric aggregations
Numeric arithmetic
Numeric rounding
Logical and comparison functions
String functionality
Datetime functionality

The following are just some examples of operations that are accelerated by native PyArrow compute functions.

In [37]:importpyarrowaspaIn [38]:ser=pd.Series([-1.545,0.211,None],dtype="float32[pyarrow]")In [39]:ser.mean()Out[39]:-0.6669999808073044In [40]:ser+serOut[40]:0    -3.091    0.4222     <NA>dtype: float[pyarrow]In [41]:ser>(ser+1)Out[41]:0    False1    False2     <NA>dtype: bool[pyarrow]In [42]:ser.dropna()Out[42]:0   -1.5451    0.211dtype: float[pyarrow]In [43]:ser.isna()Out[43]:0    False1    False2     Truedtype: boolIn [44]:ser.fillna(0)Out[44]:0   -1.5451    0.2112      0.0dtype: float[pyarrow]

In [45]:ser_str=pd.Series(["a","b",None],dtype=pd.ArrowDtype(pa.string()))In [46]:ser_str.str.startswith("a")Out[46]:0     True1    False2     <NA>dtype: bool[pyarrow]

In [47]:fromdatetimeimportdatetimeIn [48]:pa_type=pd.ArrowDtype(pa.timestamp("ns"))In [49]:ser_dt=pd.Series([datetime(2022,1,1),None],dtype=pa_type)In [50]:ser_dt.dt.strftime("%Y-%m")Out[50]:0    2022-011       <NA>dtype: string[pyarrow]

I/O Reading#

PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The followingfunctions provide anengine keyword that can dispatch to PyArrow to accelerate reading from an IO source.

In [51]:importioIn [52]:data=io.StringIO("""a,b,c   ....:   1,2.5,True   ....:   3,4.5,False   ....:""")   ....:In [53]:df=pd.read_csv(data,engine="pyarrow")In [54]:dfOut[54]:   a    b      c0  1  2.5   True1  3  4.5  False

By default, these functions and all other IO reader functions return NumPy-backed data. These readers can returnPyArrow-backed data by specifying the parameterdtype_backend="pyarrow". A reader does not need to setengine="pyarrow" to necessarily return PyArrow-backed data.

In [55]:importioIn [56]:data=io.StringIO("""a,b,c,d,e,f,g,h,i   ....:    1,2.5,True,a,,,,,   ....:    3,4.5,False,b,6,7.5,True,a,   ....:""")   ....:In [57]:df_pyarrow=pd.read_csv(data,dtype_backend="pyarrow")In [58]:df_pyarrow.dtypesOut[58]:a     int64[pyarrow]b    double[pyarrow]c      bool[pyarrow]d    string[pyarrow]e     int64[pyarrow]f    double[pyarrow]g      bool[pyarrow]h    string[pyarrow]i      null[pyarrow]dtype: object

Several non-IO reader functions can also use thedtype_backend argument to return PyArrow-backed data including:

On this page

Show Source

Movatterモバイル変換

PyArrow Functionality#

Data Structure Integration#

Operations#

I/O Reading#

PyArrow Functionality #