Pandas Integration #

To interface withpandas, PyArrow providesvarious conversion routines to consume pandas structures and convert backto them.

Note

While pandas uses NumPy as a backend, it has enough peculiarities(such as a different type system, and support for null values) that thisis a separate topic fromNumPy Integration.

To follow examples in this document, make sure to run:

In [1]:importpandasaspdIn [2]:importpyarrowaspa

DataFrames#

The equivalent to a pandas DataFrame in Arrow is aTable.Both consist of a set of named columns of equal length. While pandas onlysupports flat columns, the Table also provides nested columns, thus it canrepresent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by callingpyarrow.Table.to_pandas(). The inverse is then achieved by usingpyarrow.Table.from_pandas().

importpyarrowaspaimportpandasaspddf=pd.DataFrame({"a":[1,2,3]})# Convert from pandas to Arrowtable=pa.Table.from_pandas(df)# Convert back to pandasdf_new=table.to_pandas()# Infer Arrow schema from pandasschema=pa.Schema.from_pandas(df)

By defaultpyarrow tries to preserve and restore the.indexdata as accurately as possible. See the section below for more aboutthis, and how to disable this logic.

Series#

In Arrow, the most similar structure to a pandas Series is an Array.It is a vector that contains data of the same type as linear memory. You canconvert a pandas Series to an Arrow Array usingpyarrow.Array.from_pandas().As Arrow Arrays are always nullable, you can supply an optional mask usingthemask parameter to mark all null-entries.

Handling pandas Indexes#

Methods likepyarrow.Table.from_pandas() have apreserve_index option which defines how to preserve (store) or notto preserve (to not store) the data in theindex member of thecorresponding pandas object. This data is tracked using schema-levelmetadata in the internalarrow::Schema object.

The default ofpreserve_index isNone, which behaves asfollows:

RangeIndex is stored as metadata-only, not requiring any extrastorage.
Other index types are stored as one or more physical data columns inthe resultingTable

To not store the index at all passpreserve_index=False. Sincestoring aRangeIndex can cause issues in some limited scenarios(such as storing multiple DataFrame objects in a Parquet file), toforce all index data to be serialized in the resulting table, passpreserve_index=True.

Type differences#

With the current design of pandas and Arrow, it is not possible to convert allcolumn types unmodified. One of the main issues here is that pandas has nosupport for nullable columns of arbitrary type. Alsodatetime64 is currentlyfixed to nanosecond resolution. On the other side, Arrow might be still missingsupport for some types.

pandas -> Arrow Conversion#

Source Type (pandas)	Destination Type (Arrow)
`bool`	`BOOL`
`(u)int{8,16,32,64}`	`(U)INT{8,16,32,64}`
`float32`	`FLOAT`
`float64`	`DOUBLE`
`str` /`unicode`	`STRING`
`pd.Categorical`	`DICTIONARY`
`pd.Timestamp`	`TIMESTAMP(unit=ns)`
`datetime.date`	`DATE`
`datetime.time`	`TIME64`

Arrow -> pandas Conversion#

Source Type (Arrow)	Destination Type (pandas)
`BOOL`	`bool`
`BOOL`with nulls	`object` (with values`True`,`False`,`None`)
`(U)INT{8,16,32,64}`	`(u)int{8,16,32,64}`
`(U)INT{8,16,32,64}`with nulls	`float64`
`FLOAT`	`float32`
`DOUBLE`	`float64`
`STRING`	`str`
`DICTIONARY`	`pd.Categorical`
`TIMESTAMP(unit=*)`	`pd.Timestamp` (`np.datetime64[ns]`)
`DATE`	`object` (with`datetime.date` objects)
`TIME64`	`object` (with`datetime.time` objects)

Categorical types#

Pandas categoricalcolumns are converted toArrow dictionary arrays,a special array type optimized to handle repeated and limitednumber of possible values.

In [3]:df=pd.DataFrame({"cat":pd.Categorical(["a","b","c","a","b","c"])})In [4]:df.cat.dtype.categoriesOut[4]:Index(['a', 'b', 'c'], dtype='object')In [5]:dfOut[5]:  cat0   a1   b2   c3   a4   b5   cIn [6]:table=pa.Table.from_pandas(df)In [7]:tableOut[7]:pyarrow.Tablecat: dictionary<values=string, indices=int8, ordered=0>----cat: [  -- dictionary:["a","b","c"]--indices:[0,1,2,0,1,2]]

We can inspect theChunkedArray of the created table and see thesame categories of the Pandas DataFrame.

In [8]:column=table[0]In [9]:chunk=column.chunk(0)In [10]:chunk.dictionaryOut[10]:<pyarrow.lib.StringArray object at 0x7fdffc8ff7c0>[  "a",  "b",  "c"]In [11]:chunk.indicesOut[11]:<pyarrow.lib.Int8Array object at 0x7fdffc8ff340>[  0,  1,  2,  0,  1,  2]

Datetime (Timestamp) types#

Pandas Timestampsuse thedatetime64[ns] type in Pandas and are converted to an ArrowTimestampArray.

In [12]:df=pd.DataFrame({"datetime":pd.date_range("2020-01-01T00:00:00Z",freq="h",periods=3)})In [13]:df.dtypesOut[13]:datetime    datetime64[ns, UTC]dtype: objectIn [14]:dfOut[14]:                   datetime0 2020-01-01 00:00:00+00:001 2020-01-01 01:00:00+00:002 2020-01-01 02:00:00+00:00In [15]:table=pa.Table.from_pandas(df)In [16]:tableOut[16]:pyarrow.Tabledatetime: timestamp[ns, tz=UTC]----datetime: [[2020-01-01 00:00:00.000000000Z,2020-01-01 01:00:00.000000000Z,2020-01-01 02:00:00.000000000Z]]

In this example the Pandas Timestamp is time zone aware(UTC on this case), and this information is used to create the ArrowTimestampArray.

Date types#

While dates can be handled using thedatetime64[ns] type inpandas, some systems work with object arrays of Python’s built-indatetime.date object:

In [17]:fromdatetimeimportdateIn [18]:s=pd.Series([date(2018,12,31),None,date(2000,1,1)])In [19]:sOut[19]:0    2018-12-311          None2    2000-01-01dtype: object

When converting to an Arrow array, thedate32 type will be used bydefault:

In [20]:arr=pa.array(s)In [21]:arr.typeOut[21]:DataType(date32[day])In [22]:arr[0]Out[22]:<pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>

To use the 64-bitdate64, specify this explicitly:

In [23]:arr=pa.array(s,type='date64')In [24]:arr.typeOut[24]:DataType(date64[ms])

When converting back withto_pandas, object arrays ofdatetime.date objects are returned:

In [25]:arr.to_pandas()Out[25]:0    2018-12-311          None2    2000-01-01dtype: object

If you want to use NumPy’sdatetime64 dtype instead, passdate_as_object=False:

In [26]:s2=pd.Series(arr.to_pandas(date_as_object=False))In [27]:s2.dtypeOut[27]:dtype('<M8[ms]')

Warning

As of Arrow0.13 the parameterdate_as_object isTrueby default. Older versions must passdate_as_object=True toobtain this behavior

Time types#

The builtindatetime.time objects inside Pandas data structures will beconverted to an Arrowtime64 andTime64Array respectively.

In [28]:fromdatetimeimporttimeIn [29]:s=pd.Series([time(1,1,1),time(2,2,2)])In [30]:sOut[30]:0    01:01:011    02:02:02dtype: objectIn [31]:arr=pa.array(s)In [32]:arr.typeOut[32]:Time64Type(time64[us])In [33]:arrOut[33]:<pyarrow.lib.Time64Array object at 0x7fdfff0a7160>[  01:01:01.000000,  02:02:02.000000]

When converting to pandas, arrays ofdatetime.time objects are returned:

In [34]:arr.to_pandas()Out[34]:0    01:01:011    02:02:02dtype: object

Nullable types#

In Arrow all data types are nullable, meaning they support storing missingvalues. In pandas, however, not all data types have support for missing data.Most notably, the default integer data types do not, and will get castedto float when missing values are introduced. Therefore, when an Arrow arrayor table gets converted to pandas, integer columns will become float whenmissing values are present:

>>>arr=pa.array([1,2,None])>>>arr<pyarrow.lib.Int64Array object at 0x7f07d467c640>[  1,  2,  null]>>>arr.to_pandas()0    1.01    2.02    NaNdtype: float64

Pandas has experimental nullable data types(https://pandas.pydata.org/docs/user_guide/integer_na.html). Arrows supportsround trip conversion for those:

>>>df=pd.DataFrame({'a':pd.Series([1,2,None],dtype="Int64")})>>>df      a0     11     22  <NA>>>>table=pa.table(df)>>>tableOut[32]:pyarrow.Tablea: int64----a: [[1,2,null]]>>>table.to_pandas()      a0     11     22  <NA>>>>table.to_pandas().dtypesa    Int64dtype: object

This roundtrip conversion works because metadata about the original pandasDataFrame gets stored in the Arrow table. However, if you have Arrow data (ore.g. a Parquet file) not originating from a pandas DataFrame with nullabledata types, the default conversion to pandas will not use those nullabledtypes.

Thepyarrow.Table.to_pandas() method has atypes_mapper keywordthat can be used to override the default data type used for the resultingpandas DataFrame. This way, you can instruct Arrow to create a pandasDataFrame using nullable dtypes.

>>>table=pa.table({"a":[1,2,None]})>>>table.to_pandas()     a0  1.01  2.02  NaN>>>table.to_pandas(types_mapper={pa.int64():pd.Int64Dtype()}.get)      a0     11     22  <NA>

Thetypes_mapper keyword expects a function that will return the pandasdata type to use given a pyarrow data type. By using thedict.get method,we can create such a function using a dictionary.

If you want to use all currently supported nullable dtypes by pandas, thisdictionary becomes:

dtype_mapping={pa.int8():pd.Int8Dtype(),pa.int16():pd.Int16Dtype(),pa.int32():pd.Int32Dtype(),pa.int64():pd.Int64Dtype(),pa.uint8():pd.UInt8Dtype(),pa.uint16():pd.UInt16Dtype(),pa.uint32():pd.UInt32Dtype(),pa.uint64():pd.UInt64Dtype(),pa.bool_():pd.BooleanDtype(),pa.float32():pd.Float32Dtype(),pa.float64():pd.Float64Dtype(),pa.string():pd.StringDtype(),}df=table.to_pandas(types_mapper=dtype_mapping.get)

When using the pandas API for reading Parquet files (pd.read_parquet(..)),this can also be achieved by passinguse_nullable_dtypes:

df=pd.read_parquet(path,use_nullable_dtypes=True)

Memory Usage and Zero Copy#

When converting from Arrow data structures to pandas objects using variousto_pandas methods, one must occasionally be mindful of issues related toperformance and memory usage.

Since pandas’s internal data representation is generally different from theArrow columnar format, zero copy conversions (where no memory allocation orcomputation is required) are only possible in certain limited cases.

In the worst case scenario, callingto_pandas will result in two versionsof the data in memory, one for Arrow and one for pandas, yielding approximatelytwice the memory footprint. We have implement some mitigations for this case,particularly when creating largeDataFrame objects, that we describe below.

Zero Copy Series Conversions#

Zero copy conversions fromArray orChunkedArray to NumPy arrays orpandas Series are possible in certain narrow cases:

The Arrow data is stored in an integer (signed or unsignedint8 throughint64) or floating point type (float16 throughfloat64). Thisincludes many numeric types as well as timestamps.
The Arrow data has no null values (since these are represented using bitmapswhich are not supported by pandas).
ForChunkedArray, the data consists of a single chunk,i.e.arr.num_chunks==1. Multiple chunks will always require a copybecause of pandas’s contiguousness requirement.

In these scenarios,to_pandas orto_numpy will be zero copy. In allother scenarios, a copy will be required.

Reducing Memory Use in`Table.to_pandas`#

As of this writing, pandas applies a data management strategy called“consolidation” to collect like-typed DataFrame columns in two-dimensionalNumPy arrays, referred to internally as “blocks”. We have gone to great effortto construct the precise “consolidated” blocks so that pandas will not performany further allocation or copies after we hand off the data topandas.DataFrame. The obvious downside of this consolidation strategy isthat it forces a “memory doubling”.

To try to limit the potential effects of “memory doubling” duringTable.to_pandas, we provide a couple of options:

split_blocks=True, when enabledTable.to_pandas produces one internalDataFrame “block” for each column, skipping the “consolidation” step. Notethat many pandas operations will trigger consolidation anyway, but the peakmemory use may be less than the worst case scenario of a full memorydoubling. As a result of this option, we are able to do zero copy conversionsof columns in the same cases where we can do zero copy withArray andChunkedArray.
self_destruct=True, this destroys the internal Arrow memory buffers ineach columnTable object as they are converted to the pandas-compatiblerepresentation, potentially releasing memory to the operating system as soonas a column is converted. Note that this renders the callingTable objectunsafe for further use, and any further methods called will cause your Pythonprocess to crash.

Used together, the call

df=table.to_pandas(split_blocks=True,self_destruct=True)deltable# not necessary, but a good practice

will yield significantly lower memory usage in some scenarios. Without theseoptions,to_pandas will always double memory.

Note thatself_destruct=True is not guaranteed to save memory. Since theconversion happens column by column, memory is also freed column by column. Butif multiple columns share an underlying buffer, then no memory will be freeduntil all of those columns are converted. In particular, due to implementationdetails, data that comes from IPC or Flight is prone to this, as memory will belaid out as follows:

RecordBatch0:Allocation0:array0chunk0,array1chunk0,...RecordBatch1:Allocation1:array0chunk1,array1chunk1,......

In this case, no memory can be freed until the entire table is converted, evenwithself_destruct=True.

On this page

Edit on GitHub

Movatterモバイル変換

Pandas Integration#