Dataframe Interchange Protocol#
The interchange protocol is implemented forpa.Table andpa.RecordBatch and is used to interchange data betweenPyArrow and other dataframe libraries that also have theprotocol implemented. The data structures that are supportedin the protocol are primitive data types plus the dictionarydata type. The protocol also has missing data support andit supports chunking, meaning accessing thedata in “batches” of rows.
The Python dataframe interchange protocol is designed by theConsortium for Python Data API Standardsin order to enable data interchange between dataframelibraries in the Python ecosystem. See more about thestandard in theprotocol documentation.
From PyArrow to other libraries:__dataframe__() method#
The__dataframe__() method creates a new exchange object thatthe consumer library can take and construct an object of it’s own.
>>>importpyarrowaspa>>>table=pa.table({"n_attendees":[100,10,1]})>>>table.__dataframe__()<pyarrow.interchange.dataframe._PyArrowDataFrame object at ...>
This is meant to be used by the consumer library when callingthefrom_dataframe() function and is not meant to be used manuallyby the user.
From other libraries to PyArrow:from_dataframe()#
With thefrom_dataframe() function, we can construct apyarrow.Tablefrom any dataframe object that implements the__dataframe__() method via the dataframe interchangeprotocol.
We can for example take a pandas dataframe and construct aPyArrow table with the use of the interchange protocol:
>>>importpyarrow>>>frompyarrow.interchangeimportfrom_dataframe>>>importpandasaspd>>>df=pd.DataFrame({..."n_attendees":[100,10,1],..."country":["Italy","Spain","Slovenia"],...})>>>df n_attendees country0 100 Italy1 10 Spain2 1 Slovenia>>>from_dataframe(df)pyarrow.Tablen_attendees: int64country: large_string----n_attendees: [[100,10,1]]country: [["Italy","Spain","Slovenia"]]
We can do the same with a polars dataframe:
>>>importpolarsaspl>>>fromdatetimeimportdatetime>>>arr=[datetime(2023,5,20,10,0),...datetime(2023,5,20,11,0),...datetime(2023,5,20,13,30)]>>>df=pl.DataFrame({...'Talk':['About Polars','Intro into PyArrow','Coding in Rust'],...'Time':arr,...})>>>dfshape: (3, 2)┌────────────────────┬─────────────────────┐│ Talk ┆ Time ││ --- ┆ --- ││ str ┆ datetime[μs] │╞════════════════════╪═════════════════════╡│ About Polars ┆ 2023-05-20 10:00:00 ││ Intro into PyArrow ┆ 2023-05-20 11:00:00 ││ Coding in Rust ┆ 2023-05-20 13:30:00 │└────────────────────┴─────────────────────┘>>>from_dataframe(df)pyarrow.TableTalk: large_stringTime: timestamp[us]----Talk: [["About Polars","Intro into PyArrow","Coding in Rust"]]Time: [[2023-05-20 10:00:00.000000,2023-05-20 11:00:00.000000,2023-05-20 13:30:00.000000]]

