- Notifications
You must be signed in to change notification settings - Fork1
Efficient Pandas representation for nested associated datasets.
License
lincc-frameworks/nested-pandas
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
An extension of pandas for efficient representation of nestedassociated datasets.
Nested-Pandas extends thepandas package withtooling and support for nested dataframes packed into values of top-leveldataframe columns.Pyarrowis used internally to aid in scalability and performance.
Nested-Pandas allows data like this:
To instead be represented like this:
Where the nested data is represented as nested dataframes:
# Each row of "object_nf" now has it's own sub-dataframe of matched rows from "source_df"object_nf.loc[0]["nested_sources"]
Allowing powerful and straightforward operations, like:
# Compute the mean flux for each row of "object_nf"importnumpyasnpobject_nf.reduce(np.mean,"nested_sources.flux")
Nested-Pandas is motivated by time-domain astronomy use cases, where we seetypically two levels of information, information about astronomical objects andthen an associated set ofN
measurements of those objects. Nested-Pandas offersa performant and memory-efficient package for working with these types of datasets.
Core advantages being:
- hierarchical column access
- efficient packing of nested information into inputs to custom user functions
- avoiding costly groupby operations
This is a LINCC Frameworks project - find more information about LINCC Frameworkshere.
This project is supported by Schmidt Sciences.
About
Efficient Pandas representation for nested associated datasets.