- Notifications
You must be signed in to change notification settings - Fork25
Specification for storing geospatial data in Apache Arrow
License
geoarrow/geoarrow
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains a specification for storing geospatial data in Apache Arrowand Arrow-compatible data structures and formats.
TheApache Arrow project specifies a standardizedlanguage-independent columnar memory format. It enables shared computational libraries,zero-copy shared memory and streaming messaging, interprocess communication, and issupported by many programming languages and data libraries.
Spatial information can be represented as a collection of discrete objects using points,lines and polygons (i.e., vector data). TheSimple Feature Access standard provides a widelyused abstraction, defining a set of geometries: Point, LineString, Polygon, MultiPoint,MultiLineString, MultiPolygon, and GeometryCollection. Next to a geometry, simple featurescan also have non-spatial attributes that describe the feature.
Geospatial data often comes in tabular format, with one or more columns withfeature geometries and additional columns with feature attributes. The Arrow columnarmemory model is well-suited to store both vector features andtheir attribute data. The GeoArrow specification defines how the vector features(geometries) can be stored in Arrow (and Arrow-compatible) data structures.
This repository contains the specifications for:
- The memory layout for storing geometries in an Arrow array (format.md)
- The Arrow extension type definitions that ensure type-level metadata (e.g., CRS) ispropagated when used in Arrow implementations (extension-types.md)
Defining a standard and efficient way to store geospatial data in the Arrow memorylayout enables interoperability between different tools and ensures geospatial tools canleverage the growing Apache Arrow ecosystem:
- Efficient, columnar file formats. Leveraging the performant and compact storage ofApache Parquet as a vector data format in geospatial tools usingGeoParquet
- Accelerated between-process geospatial data exchange using Apache Arrow IPC messageformat and Apache Arrow Flight
- Zero-copy in-process geospatial data transport using the Apache Arrow C Data Interface(e.g., GDAL)
- Shared libraries for geospatial data type representation and computation for queryengines that support columnar data formats (e.g., Velox, DuckDB, and Acero)
The GeoParquet specification originally started in this repo, but was moved out into itsown repo, leaving this repo to focus onthe Arrow-specific specifications (Arrow layout and extension type metadata). WhereasGeoParquet is a file-level metadata specification, GeoArrow is a field-level metadataand memory layout specification that applies in-memory (e.g., an Arrow array), on disk (e.g., usingParquet readers/writers provided by an Arrow implementation), and over the wire (e.g.,using the Arrow IPC format).
- geoarrow-c: geospatial type system andgeneric coordinate-shuffling library written in C with bindings in C++, R, and Python
- geoarrow-rs: Rust implementation of theGeoArrow specification and bindings to GeoRust algorithms for efficient spatialoperations on GeoArrow memory. See also:
- Python bindings to geoarrow-rs
- geoarrow-wasm, JavaScript (WebAssembly) bindings to geoarrow-rs.
- geoarrow-python: Python bindings to geoarrow-cthat provide integrations with libraries like pyarrow, pandas, andgeopandas.
- geoarrow-r: R bindings to geoarrow-c that provideintegrations with libraries like sf and Arrow for geospatial data handling.
- geoarrow-js: Pure TypeScript implementation of GeoArrow, on top of the Arrow JavaScript implementation.
- Lonboard: fast, interactive geospatial vector data visualization in Jupyter, building on top of GeoArrow.
About
Specification for storing geospatial data in Apache Arrow