Morgan243/InteractiveDataTreePublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

DataTree - easy access and storage of data objects

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
interactive_data_tree		interactive_data_tree
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
setup.py		setup.py

Repository files navigation

DataTree

What is it

DataTree implements a simple file-system-based storage of Python objectsin a way that facilitates quick and simple programmatic access (i.e. interactive).Functionally, the DataTree builds a dot representation (attribute access on objects)based on files and directories. Directories are 'Repos' (repositories) and filesare mapped onto a DataTree StorageInterface based on the file's extension(e.g. pickle -> '.pkl').

The package comes with built-in support for general objects (pickle) and Pandasobjects stored in HDF. The library is Jupyter notebook-aware and will use HTMLrepresentations where sensible.

Features

Interactive - Repositories and their objects are represented as properties nested in objects
- Enablestab completion, monkey patcheddoc-strings with metadata, andrich HTML representations
Maintainable - no special internal data or database systems, no lock-in.
- DataTree is currently self-contained in a single module file - drop it in and start using
Flexible - Primary data storage logic is generalized in the StorageInterface class.
- Adding features or a interface for your project only requires extending this class
Metadata - Every object stores basic metadata (time of operation, object type, etc.) which is made searchable. More targeted StorageInterfaces can extract more significant metadata related to the type.
- For instance, the HDF interface for Pandas objects will store the columns and a sample of the index

What's it solving

Iterative tasks in data analysis often require more than one dataset. Furthermore, thedata may come from a variety of locations - the web, a database, unstructured data, etc - and may not be well representedby a traditional table. Thus, practitioners are left trying to manage the source data (and their source systems) as well asany intermediate and/or output datasets. It's a difficult and time consuming task.

The ultimate focus of this project being tomake the management of many varieddatasets simple, maintainable, and portable. The only expectation for use with DataTree is that thethe data can be represented in a way that can be stored on the local filesystems. For standard datasets,this likely means storing the data itself in the DataTree. However, new interfaces can be implementedthat simply store the information required to access a remote system (e.g. store a JSON file with connectioninformation and a SQL query - the storage interface can then lazily retrieve data on load).

How does it work

DataTree manages the creation of directories and subsequent object files. Each directoryis referred to as a 'repo', and the generic object files in these repos are mapped to DataTree StorageTypes.In the common case, objects are pickle-serialized Python objects, but storage types have few requirements and are easilyextended. An object name may have several different types stored, but namespaces across types won't collide.

Every object has metadata stored alongside it in a JSON file. Each storage type canchoose to use the default metadata storage, amend the default storage by overriding the writeprocedure, or implement and entirely different metadata storage all together.

Example

importinteractive_data_treeasidt# Repository Tree Root defaults to ~/.idt_root.repotr=idt.RepoTree()# Make a new Sub-repo# - This creates the directory ~/.idt_root.repo/lvl1.repolvl1=tr.mkrepo('lvl1')# Save out a string - DataTree will default to pickle if doesn't have a better type# - This writes the file ~/.idt_root.repo/lvl1.repo/test_foobar.pkl# - Metadata stored in ~/.idt_root.repo/lvl1.repo/test_foobar.pkl.mdjsonlvl1.save('foo bar str object',name='test_foobar')# Flexible ways to access - any script or notebook can now easily access this dataassertlvl1==tr.lvl1print(lvl1.test_foobar.load())print(tr['lvl1'].test_foobar.load())

About

DataTree - easy access and storage of data objects

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DataTree

What is it

Features

What's it solving

How does it work

Example

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

Morgan243/InteractiveDataTree

Folders and files

Latest commit

History

Repository files navigation

DataTree

What is it

Features

What's it solving

How does it work

Example

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages