- User Guide
- PyArrow...
PyArrow Functionality#
pandas can utilizePyArrow to extend functionality and improve the performanceof various APIs. This includes:
More extensivedata types compared to NumPy
Missing data support (NA) for all data types
Performant IO reader integration
Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)
To use this functionality, please ensure you haveinstalled the minimum supported PyArrow version.
Data Structure Integration#
ASeries
,Index
, or the columns of aDataFrame
can be directly backed by apyarrow.ChunkedArray
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by[pyarrow]
, e.g."int64[pyarrow]""
into thedtype
parameter
In [1]:ser=pd.Series([-1.5,0.2,None],dtype="float32[pyarrow]")In [2]:serOut[2]:0 -1.51 0.22 <NA>dtype: float[pyarrow]In [3]:idx=pd.Index([True,None],dtype="bool[pyarrow]")In [4]:idxOut[4]:Index([True, <NA>], dtype='bool[pyarrow]')In [5]:df=pd.DataFrame([[1,2],[3,4]],dtype="uint64[pyarrow]")In [6]:dfOut[6]: 0 10 1 21 3 4
Note
The string alias"string[pyarrow]"
maps topd.StringDtype("pyarrow")
which is not equivalent tospecifyingdtype=pd.ArrowDtype(pa.string())
. Generally, operations on the data will behave similarlyexceptpd.StringDtype("pyarrow")
can return NumPy-backed nullable types whilepd.ArrowDtype(pa.string())
will returnArrowDtype
.
In [7]:importpyarrowaspaIn [8]:data=list("abc")In [9]:ser_sd=pd.Series(data,dtype="string[pyarrow]")In [10]:ser_ad=pd.Series(data,dtype=pd.ArrowDtype(pa.string()))In [11]:ser_ad.dtype==ser_sd.dtypeOut[11]:FalseIn [12]:ser_sd.str.contains("a")Out[12]:0 True1 False2 Falsedtype: booleanIn [13]:ser_ad.str.contains("a")Out[13]:0 True1 False2 Falsedtype: bool[pyarrow]
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parametersintoArrowDtype
to use in thedtype
parameter.
In [14]:importpyarrowaspaIn [15]:list_str_type=pa.list_(pa.string())In [16]:ser=pd.Series([["hello"],["there"]],dtype=pd.ArrowDtype(list_str_type))In [17]:serOut[17]:0 ['hello']1 ['there']dtype: list<item: string>[pyarrow]
In [18]:fromdatetimeimporttimeIn [19]:idx=pd.Index([time(12,30),None],dtype=pd.ArrowDtype(pa.time64("us")))In [20]:idxOut[20]:Index([12:30:00, <NA>], dtype='time64[us][pyarrow]')
In [21]:fromdecimalimportDecimalIn [22]:decimal_type=pd.ArrowDtype(pa.decimal128(3,scale=2))In [23]:data=[[Decimal("3.19"),None],[None,Decimal("-1.23")]]In [24]:df=pd.DataFrame(data,dtype=decimal_type)In [25]:dfOut[25]: 0 10 3.19 <NA>1 <NA> -1.23
If you already have anpyarrow.Array
orpyarrow.ChunkedArray
,you can pass it intoarrays.ArrowExtensionArray
to construct the associatedSeries
,Index
orDataFrame
object.
In [26]:pa_array=pa.array( ....:[{"1":"2"},{"10":"20"},None], ....:type=pa.map_(pa.string(),pa.string()), ....:) ....:In [27]:ser=pd.Series(pd.arrays.ArrowExtensionArray(pa_array))In [28]:serOut[28]:0 [('1', '2')]1 [('10', '20')]2 <NA>dtype: map<string, string>[pyarrow]
To retrieve a pyarrowpyarrow.ChunkedArray
from aSeries
orIndex
, you can callthe pyarrow array constructor on theSeries
orIndex
.
In [29]:ser=pd.Series([1,2,None],dtype="uint8[pyarrow]")In [30]:pa.array(ser)Out[30]:<pyarrow.lib.UInt8Array object at 0x7fe8a1ec9060>[ 1, 2, null]In [31]:idx=pd.Index(ser)In [32]:pa.array(idx)Out[32]:<pyarrow.lib.UInt8Array object at 0x7fe8ae1628c0>[ 1, 2, null]
To convert apyarrow.Table
to aDataFrame
, you can call thepyarrow.Table.to_pandas()
method withtypes_mapper=pd.ArrowDtype
.
In [33]:table=pa.table([pa.array([1,2,3],type=pa.int64())],names=["a"])In [34]:df=table.to_pandas(types_mapper=pd.ArrowDtype)In [35]:dfOut[35]: a0 11 22 3In [36]:df.dtypesOut[36]:a int64[pyarrow]dtype: object
Operations#
PyArrow data structure integration is implemented through pandas’ExtensionArray
interface;therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionalityis accelerated with PyArrowcompute functions where available. This includes:
Numeric aggregations
Numeric arithmetic
Numeric rounding
Logical and comparison functions
String functionality
Datetime functionality
The following are just some examples of operations that are accelerated by native PyArrow compute functions.
In [37]:importpyarrowaspaIn [38]:ser=pd.Series([-1.545,0.211,None],dtype="float32[pyarrow]")In [39]:ser.mean()Out[39]:-0.6669999808073044In [40]:ser+serOut[40]:0 -3.091 0.4222 <NA>dtype: float[pyarrow]In [41]:ser>(ser+1)Out[41]:0 False1 False2 <NA>dtype: bool[pyarrow]In [42]:ser.dropna()Out[42]:0 -1.5451 0.211dtype: float[pyarrow]In [43]:ser.isna()Out[43]:0 False1 False2 Truedtype: boolIn [44]:ser.fillna(0)Out[44]:0 -1.5451 0.2112 0.0dtype: float[pyarrow]
In [45]:ser_str=pd.Series(["a","b",None],dtype=pd.ArrowDtype(pa.string()))In [46]:ser_str.str.startswith("a")Out[46]:0 True1 False2 <NA>dtype: bool[pyarrow]
In [47]:fromdatetimeimportdatetimeIn [48]:pa_type=pd.ArrowDtype(pa.timestamp("ns"))In [49]:ser_dt=pd.Series([datetime(2022,1,1),None],dtype=pa_type)In [50]:ser_dt.dt.strftime("%Y-%m")Out[50]:0 2022-011 <NA>dtype: string[pyarrow]
I/O Reading#
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The followingfunctions provide anengine
keyword that can dispatch to PyArrow to accelerate reading from an IO source.
In [51]:importioIn [52]:data=io.StringIO("""a,b,c ....: 1,2.5,True ....: 3,4.5,False ....:""") ....:In [53]:df=pd.read_csv(data,engine="pyarrow")In [54]:dfOut[54]: a b c0 1 2.5 True1 3 4.5 False
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can returnPyArrow-backed data by specifying the parameterdtype_backend="pyarrow"
. A reader does not need to setengine="pyarrow"
to necessarily return PyArrow-backed data.
In [55]:importioIn [56]:data=io.StringIO("""a,b,c,d,e,f,g,h,i ....: 1,2.5,True,a,,,,, ....: 3,4.5,False,b,6,7.5,True,a, ....:""") ....:In [57]:df_pyarrow=pd.read_csv(data,dtype_backend="pyarrow")In [58]:df_pyarrow.dtypesOut[58]:a int64[pyarrow]b double[pyarrow]c bool[pyarrow]d string[pyarrow]e int64[pyarrow]f double[pyarrow]g bool[pyarrow]h string[pyarrow]i null[pyarrow]dtype: object
Several non-IO reader functions can also use thedtype_backend
argument to return PyArrow-backed data including: