Python#
PyArrow - Apache Arrow Python bindings#
This is the documentation of the Python API of Apache Arrow.
Apache Arrow is a universal columnar format and multi-language toolbox for fastdata interchange and in-memory analytics. It contains a set of technologies thatenable data systems to efficiently store, process, and move data.
See theparent documentation for additional details onthe Arrow Project itself, on the Arrow format and the other language bindings.
The Arrow Python bindings (also named “PyArrow”) have first-class integrationwith NumPy, pandas, and built-in Python objects. They are based on the C++implementation of Arrow.
Here we will detail the usage of the Python API for Arrow and the leaflibraries that add additional functionality such as reading Apache Parquetfiles into Arrow structures.
- Installing PyArrow
- Getting Started
- Data Types and In-Memory Data Model
- Compute Functions
- Memory and IO Interfaces
- Streaming, Serialization, and IPC
- Filesystem Interface
- NumPy Integration
- Pandas Integration
- Dataframe Interchange Protocol
- The DLPack Protocol
- Timestamps
- Reading and Writing the Apache ORC Format
- Reading and Writing CSV files
- Feather File Format
- Reading JSON files
- Reading and Writing the Apache Parquet Format
- Obtaining pyarrow with Parquet Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Inspecting the Parquet File Metadata
- Data Type Handling
- Compression, Encoding, and File Compatibility
- Partitioned Datasets (Multiple Files)
- Writing to Partitioned Datasets
- Reading from Partitioned Datasets
- Using with Spark
- Multithreaded Reads
- Reading from cloud storage
- Parquet Modular Encryption (Columnar Encryption)
- Content-Defined Chunking
- Tabular Datasets
- Arrow Flight RPC
- Extending PyArrow
- PyArrow Integrations
- Environment Variables
- API Reference
- Getting Involved
- Benchmarks
- Python cookbook

