Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

DataTree - easy access and storage of data objects

NotificationsYou must be signed in to change notification settings

Morgan243/InteractiveDataTree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build StatusCoverage Status

What is it

DataTree implements a simple file-system-based storage of Python objectsin a way that facilitates quick and simple programmatic access (i.e. interactive).Functionally, the DataTree builds a dot representation (attribute access on objects)based on files and directories. Directories are 'Repos' (repositories) and filesare mapped onto a DataTree StorageInterface based on the file's extension(e.g. pickle -> '.pkl').

The package comes with built-in support for general objects (pickle) and Pandasobjects stored in HDF. The library is Jupyter notebook-aware and will use HTMLrepresentations where sensible.

Features
  • Interactive - Repositories and their objects are represented as properties nested in objects
    • Enablestab completion, monkey patcheddoc-strings with metadata, andrich HTML representations
  • Maintainable - no special internal data or database systems, no lock-in.
    • DataTree is currently self-contained in a single module file - drop it in and start using
  • Flexible - Primary data storage logic is generalized in the StorageInterface class.
    • Adding features or a interface for your project only requires extending this class
  • Metadata - Every object stores basic metadata (time of operation, object type, etc.) which is made searchable. More targeted StorageInterfaces can extract more significant metadata related to the type.
    • For instance, the HDF interface for Pandas objects will store the columns and a sample of the index

What's it solving

Iterative tasks in data analysis often require more than one dataset. Furthermore, thedata may come from a variety of locations - the web, a database, unstructured data, etc - and may not be well representedby a traditional table. Thus, practitioners are left trying to manage the source data (and their source systems) as well asany intermediate and/or output datasets. It's a difficult and time consuming task.

The ultimate focus of this project being tomake the management of many varieddatasets simple, maintainable, and portable. The only expectation for use with DataTree is that thethe data can be represented in a way that can be stored on the local filesystems. For standard datasets,this likely means storing the data itself in the DataTree. However, new interfaces can be implementedthat simply store the information required to access a remote system (e.g. store a JSON file with connectioninformation and a SQL query - the storage interface can then lazily retrieve data on load).

How does it work

DataTree manages the creation of directories and subsequent object files. Each directoryis referred to as a 'repo', and the generic object files in these repos are mapped to DataTree StorageTypes.In the common case, objects are pickle-serialized Python objects, but storage types have few requirements and are easilyextended. An object name may have several different types stored, but namespaces across types won't collide.

Every object has metadata stored alongside it in a JSON file. Each storage type canchoose to use the default metadata storage, amend the default storage by overriding the writeprocedure, or implement and entirely different metadata storage all together.

Example

importinteractive_data_treeasidt# Repository Tree Root defaults to ~/.idt_root.repotr=idt.RepoTree()# Make a new Sub-repo# - This creates the directory ~/.idt_root.repo/lvl1.repolvl1=tr.mkrepo('lvl1')# Save out a string - DataTree will default to pickle if doesn't have a better type# - This writes the file ~/.idt_root.repo/lvl1.repo/test_foobar.pkl# - Metadata stored in ~/.idt_root.repo/lvl1.repo/test_foobar.pkl.mdjsonlvl1.save('foo bar str object',name='test_foobar')# Flexible ways to access - any script or notebook can now easily access this dataassertlvl1==tr.lvl1print(lvl1.test_foobar.load())print(tr['lvl1'].test_foobar.load())

About

DataTree - easy access and storage of data objects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp