

DataTree implements a simple file-system-based storage of Python objectsin a way that facilitates quick and simple programmatic access (i.e. interactive).Functionally, the DataTree builds a dot representation (attribute access on objects)based on files and directories. Directories are 'Repos' (repositories) and filesare mapped onto a DataTree StorageInterface based on the file's extension(e.g. pickle -> '.pkl').
The package comes with built-in support for general objects (pickle) and Pandasobjects stored in HDF. The library is Jupyter notebook-aware and will use HTMLrepresentations where sensible.
- Interactive - Repositories and their objects are represented as properties nested in objects
- Enablestab completion, monkey patcheddoc-strings with metadata, andrich HTML representations
- Maintainable - no special internal data or database systems, no lock-in.
- DataTree is currently self-contained in a single module file - drop it in and start using
- Flexible - Primary data storage logic is generalized in the StorageInterface class.
- Adding features or a interface for your project only requires extending this class
- Metadata - Every object stores basic metadata (time of operation, object type, etc.) which is made searchable. More targeted StorageInterfaces can extract more significant metadata related to the type.
- For instance, the HDF interface for Pandas objects will store the columns and a sample of the index
Iterative tasks in data analysis often require more than one dataset. Furthermore, thedata may come from a variety of locations - the web, a database, unstructured data, etc - and may not be well representedby a traditional table. Thus, practitioners are left trying to manage the source data (and their source systems) as well asany intermediate and/or output datasets. It's a difficult and time consuming task.
The ultimate focus of this project being tomake the management of many varieddatasets simple, maintainable, and portable. The only expectation for use with DataTree is that thethe data can be represented in a way that can be stored on the local filesystems. For standard datasets,this likely means storing the data itself in the DataTree. However, new interfaces can be implementedthat simply store the information required to access a remote system (e.g. store a JSON file with connectioninformation and a SQL query - the storage interface can then lazily retrieve data on load).
DataTree manages the creation of directories and subsequent object files. Each directoryis referred to as a 'repo', and the generic object files in these repos are mapped to DataTree StorageTypes.In the common case, objects are pickle-serialized Python objects, but storage types have few requirements and are easilyextended. An object name may have several different types stored, but namespaces across types won't collide.
Every object has metadata stored alongside it in a JSON file. Each storage type canchoose to use the default metadata storage, amend the default storage by overriding the writeprocedure, or implement and entirely different metadata storage all together.
importinteractive_data_treeasidt# Repository Tree Root defaults to ~/.idt_root.repotr=idt.RepoTree()# Make a new Sub-repo# - This creates the directory ~/.idt_root.repo/lvl1.repolvl1=tr.mkrepo('lvl1')# Save out a string - DataTree will default to pickle if doesn't have a better type# - This writes the file ~/.idt_root.repo/lvl1.repo/test_foobar.pkl# - Metadata stored in ~/.idt_root.repo/lvl1.repo/test_foobar.pkl.mdjsonlvl1.save('foo bar str object',name='test_foobar')# Flexible ways to access - any script or notebook can now easily access this dataassertlvl1==tr.lvl1print(lvl1.test_foobar.load())print(tr['lvl1'].test_foobar.load())