lrq3000/fdictPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star7

Easy out-of-core computing with nested data structures in Python as a drop-in replacement of dict

License

MIT license

7 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
perf		perf
src/fdict		src/fdict
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

fdict

Easy out-of-core computing with recursive data structures in Python with a drop-in dict replacement. Just usesfdict() instead ofdict(), you are good to go!

fdict andsfdict can be initialized with a standarddict:

fromfdictimportfdict,sfdictd=fdict({'a': {'b':1,'c': [2,3]},'d':4})

Out: {'a/c': [2, 3], 'd': 4, 'a/b': 1}

Note: if you try the code above, the output may be ordered differently, but the values should be the same.

Nested dicts will be converted on-the-fly:

d['e']= {'f': {'g': {'h':5}}}

Out: {'e/f/g/h': 5, 'a/c': [2, 3], 'd': 4, 'a/b': 1}

And it can be converted back to a standard dict at any time:

d.to_dict_nested()

Out: {'a': {'c': [2, 3], 'b': 1}, 'e': {'f': {'g': {'h': 5}}}, 'd': 4}

To store all items on disk (out-of-core computing), usesfdict, a subclass offdict using the native Python moduleshelve internally:

# Initialize an empty database in file myshelf.dbd=sfdict(filename='myshelf.db')d['a']= {'b':1,'c': [2,3]}d.sync()# synchronize all changes back to diskd.close()# should always close a db# Reopen the same databased2=sfdict(filename='myshelf.db')print(d2)d2.close()

Out: {'a/b': 1, 'a/c': [2, 3]}

The intention of this module is to provide a very easy and pythonic data structure to do out-of-core computing of very nested/recursive big data, while still having reasonably acceptable performances. Currently, no other library can do out-of-core computing of very recursive data, because they all serialize at 1st level nodes. Hence, the goal is to provide a very easy way to prototype out-of-core applications, which you can later replace with a faster datatype.

Hence, this module providesfdict() andsfdict(), which both provide a similar interface todict() with flattened keys for the first and out-of-core storage for the second (using nativeshelve library). There is no third-party dependancy.

Thefdict() class provides the basic system allowing to have an internal flattened representation of a nested dict, then you can subclass it to support your favorite out-of-core library as long as it implements dict-like methods: an exemple is provided withsfdict() usingshelve, but you can subclass to usechest,shove,sqlite,zodb, etc.

Note: if you usesfdict(), do not forget to.sync() and.close() to commit the changes back to the file.

Alternatives, notably based on numpy and so probably faster but with fixed dimensions, can be found in thewendelin.core project,zarr,zict and there is alsodask for pandas dataframes.

Differences with dict

Although maximum compatibility was the primary goal, a different implementation of course brings differences that are unavoidable.

The primary difference is that calling items(), keys(), values() and view* methods will return all children leaves nested at any level, whereas a dict returns only the direct children. Also, by default, these methods return only leaves (non-dict objects) and not nodes, although you can override this by suppling the nodes=True argument.

Another difference is conflicts: you can have an item being both a leaf and a node, because there is no way to check that there is no node without walking all items (ie, usingviewitems(), and this method is the limitation offdict data structure).

This also means that when assigning an item that was already assigned, nodes will NOT get replaced, but singleton will be correctly replaced. To be more explicit:

This works:

d=fdict({'a':1,'b': {'c':2}})d['a']=-1print(d)d['a']= {'d':3,'e':4}print(d)

{'a': -1, 'b/c': 2}

{'a/d': 3, 'a/e': 4, 'b/c': 2}

But the following does NOT work as expected:

d=fdict({'a':1,'b': {'c':2}})d['b']=-1# setting -1 to a node should delete children leaves/nodes, but here we get both a -1 value for the b node and still keep the c child leaf!print(d)

{'a': 1, 'b': -1, 'b/c': 2}

A minor difference is the handling of keys: assigning an empty dict to a key will not create the key (e.g.d['a'] = {} will not create the keya, it will stay inexistent until it gets assigned a non empty dict value), and assigning sub-keys that do not exist is ok without any prior parent dict creation (e.g.d = fdict(); d['a']['b']['c']['d']['e'] = 1 is OK).

Similarly, walkingkeys(),values() anditems() will walk through all nested leaves at any nested level. For exploration convenience, if you want a behavior similar todict to explore only the direct children displaying only the direct children, you can useviewkeys_restrict(),viewitems_restrict(),viewvalues_restrict(),firstkey(),firstitem(),firstvalue(). Note however that the walking will not be faster than walking all items (because internally that is what is being done), so you cannot optimize speed with these methods, it is only for convenience.

Another minor difference is how pop() and popitem() are handled: they will return the next leaf at any nested level, and never nodes. Thus, you cannot get the next item at a specific level, but only the next item at any nested level.

Performances

fdict was made with maximum compatibility with existing code usingdict and with reasonable performances. That's in theory, in practicefdict are slower thandict for most purposes, except setitem and getitem if you use direct access form (eg, x['a/b/c'] instead of x['a']['b']['c']).

As such, you can expect O(1) performance just likedict for any operation on leaves (non-dict objects): getitem, setitem, delitem, eq contains. In practice,fdict is about 10x slower thandict because of class overhead and key string manipulation, for both indirect access (ie,x['a']['b']['c']) and 3x slower for direct access on leaves (ie,x['a/b/c']). Thus direct access form might be preferable if you want a faster set and get. This performance cost is acceptable for a quick prototype of a bigdata database, since building and retrieving items are the most common operations.

The major drawback comes when you work on nodes (nested dict objects): since all keys are flattened and on the same level, the only way to get only the children of a nested dict (aka a branch) is to walk through all keys and filter out the ones not matching the current branch. This means that any operation on nodes will be in O(n) where n is the total number of items in the whole fdict. Affected operations are: items, keys, values, view*, iter*, delitem on nodes, eq on nodes, contains on nodes.

Interestingly, getitem on nodes is not affected, because we use a lazy approach: getting a nested dict will not build anything, it will just spawn a new fdict with a different filtering rootpath. Nothing gets evaluated, until you either attain a leaf (in this case we return the non-dict object value) or you use an operation on node such as items(). Keep in mind that any nested fdict will share the same internal flattened dict, so any nested fdict will also have access to all items at any level!

This was done by design:fdict is made to be as fast asdict to build and to retrieve leaves, in exchange for slower exploration. In other words, you can expect blazingly fast creation offdict as well as getting any leaf object at any nested level, but you should be careful when exploring. However, even if your dict is bigger than RAM, you can use the view* methods (viewitems, viewkeys, viewvalues) to walk all the items as a generator.

To circumvent this pitfall, two things were implemented:

extract() method can be used on a nested fdict to filter all keys once and build a new fdict containing only the pertinent nested items. Usage isextracted_fdict = fdict({'a': {'b': 1, 'c': [2, 3]}})['a'].extract().
fastview=True argument can be used when creating a fdict to enable the FastView mode. This mode will imply a small memory/space overhead to store nodes and also will increase complexity of setitem on nodes to O(m*l) where m is the number of parents of the current leaf added, and l the number of leaves added (usually one but if you set a dict it will be converted to multiple leaves). On the other hand, it will make items, keys, values, view* and other nodes operations methods as fast as with adict by using lookup tables to access direct children directly, which was O(n) where n was the whole list of items at any level in the fdict. It is possible to convert a non-fastview fdict to a fastview fdict, just by supplying it as the initialization dict.
nodel=True argument activates a special mode where delitem is nullified, but key lookup (eg,in contains test) time is O(1) for nodes. With standardfdict,in contains test is O(1) only for leaves and O(n) for nodes because it callsviewkeys(). With this mode, empty nodes metadata are created and so lookup for nodes existence is very fast, but at the expense that deletion is not possible because it would make the database incoherent (i.e. nodes without leaf). However, setitem to replace a leaf will still work. This mode is particularly useful for fast database building, and then you can initialize a standard fdict with your finalized nodel fdict, which will then allow you to delitem.

Thus, if you want to do data exploration on afdict, you can use either of these two approaches to speed up your exploration to a reasonable time, with performances close to adict. In practice,extract is better if you have lots of items per nesting level, whereasfastview might be better if you have a very nested structure with few items per level but lots of levels.

There is probably room for speed optimization, if you have any idea please feel free to open an issue on Github.

Note that this module is compatible withPyPy, so you might get a speed-up with this interpreter.

In any case, this module is primarily meant to do quick prototypes of bigdata databases, that you can then switch to another faster database after reworking the structure a bit.

A good example is the retrieval of online data: in this case, you care less about the data structure performance since it is negligible compared to network bandwidth and I/O. Then, when you have the data, you can rework it to convert to another type of database with a flat schema (by extracting only the fields you are interested in).

Also you can convert afdict orsfdict to a flatteneddict using theto_dict() method (ie, all items are leaves, keys are the full paths in the tree), or to a nested (natural)dict usingto_dict_nested(), you will then get a standarddict stored in RAM that you can access at full speed, or use as an input to initialize another type of out-of-core database.

Documentation

fdict class

classfdict(dict):'''    Flattened nested dict, all items are settable and gettable through ['item1']['item2'] standard form or ['item1/item2'] internal form.    This allows to replace the internal dict with any on-disk storage system like a shelve's shelf (great for huge nested dicts that cannot fit into memory).    Main limitation: an entry can be both a singleton and a nested fdict: when an item is a singleton, you can setitem to replace to a nested dict, but if it is a nested dict and you setitem it to a singleton, both will coexist. Except for fastview mode, there is no way to know if a nested dict exists unless you walk through all items, which would be too consuming for a simple setitem. In this case, a getitem will always return the singleton, but nested leaves can always be accessed via items() or by direct access (eg, x['a/b/c']).    Fastview mode: remove conflicts issue and allow for fast O(m) contains(), delete() and view*() (such as vieitems()) where m in the number of subitems, instead of O(n) where n was the total number of elements in the fdict(). Downside is setitem() being O(m) too because of nodes metadata building, and memory/storage overhead, since we store all nodes and leaves lists in order to allow for fast lookup.    '''def__init__(self,d=None,rootpath='',delimiter='/',fastview=False,nodel=False,**kwargs):

Parameters:

d:dict, optional
Initialize with a pre-existing dict.Also used internally to pass a reference to parent fdict.
rootpath:str, optional
Internal variable, define the nested level.
delimiter:str, optional
Internal delimiter for nested levels. Can also be used forgetitem direct access (e.g.x['a/b/c']).[default : '/']
fastview:bool, optional
Activates fastview mode, which makes setitem slowerin O(m*l) instead of O(1), but makes view* methods(viewitem, viewkeys, viewvalues) as fast as dict's.[default : False]
nodel:bool, optional
Activates nodel mode, which makes contains testin O(1) for nodes (leaf test is always O(1) in any mode).Only drawback: delitem is not suppressed.Useful for quick building of databases, then you canreopen the database with a normal fdict if you wantthe ability to delitem.[default : False]

Returns:

out : dict-like object.

sfdict class

classsfdict(fdict):'''    A nested dict with flattened internal representation, combined with shelve to allow for efficient storage and memory allocation of huge nested dictionnaries.    If you change leaf items (eg, list.append), do not forget to sync() to commit changes to disk and empty memory cache because else this class has no way to know if leaf items were changed!    '''def__init__(self,*args,**kwargs):

Parameters:

d:dict, optional
Initialize with a pre-existing dict.Also used internally to pass a reference to parent fdict.
rootpath:str, optional
Internal variable, define the nested level.
delimiter:str, optional
Internal delimiter for nested levels. Can also be used forgetitem direct access (e.g.x['a/b/c']).[default : '/']
fastview:bool, optional
Activates fastview mode, which makes setitem slowerin O(m*l) instead of O(1), but makes view* methods(viewitem, viewkeys, viewvalues) as fast as dict's.[default : False]
nodel:bool, optional
Activates nodel mode, which makes contains testin O(1) for nodes (leaf test is always O(1) in any mode).Only drawback: delitem is not suppressed.Useful for quick building of databases, then you canreopen the database with a normal fdict if you wantthe ability to delitem.[default : False]
filename:str, optional
Path and filename where to store the database.[default : random temporary file]
autosync:bool, optional
Commit (sync) to file at every setitem (assignment).Assignments are always stored on-disk asap, but notchanges to non-dict collections stored in leaves(e.g. updating a list stored in a leaf will not commit to disk).This option allows to sync at the next assignment automatically(because there is no way to know if a leaf collection changed).Drawback: if you do a lot of assignments, this will significantlyslow down your processing, so it is advised to rather sync()manually at regular intervals.[default : False]
writeback:bool, optional
Activates shelve writeback option. If False, only assignmentswill allow committing changes of leaf collections. See shelvedocumentation.[default : True]
forcedumbdbm:bool, optional
Force the use of the Dumb DBM implementation to managethe on-disk database (should not be used unless you get anexception because not any other implementation of anydbmcan be found on your system). Dumb DBM should work onany platform, it is native to Python.[default : False]
readonly:bool, optional
Open the database as read-only.[default : False]

Returns:

out : dict-like object.

Notes:

On Linux and MacOS, make sure to only open once each database file, there cannot be two handles pointing to the same file, this will cause a "gdm.error: resource temporarily unavailable" exception!

LICENCE

This library is licensed under the MIT License and was created by Stephen Karl Larroque.

It was initially made for the Coma Science Group - GIGA Consciousness - CHU de Liege, Belgium, as part of the paneuropean CENTER-TBI project.

Included are theflatkeys function bybfontaine and_count_iter_items byzuo.

About

Easy out-of-core computing with nested data structures in Python as a drop-in replacement of dict

Releases7

fdict v0.8.1 Latest

Aug 8, 2017

+ 6 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fdict

Differences with dict

Performances

Documentation

fdict class

sfdict class

LICENCE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

lrq3000/fdict

Folders and files

Latest commit

History

Repository files navigation

fdict

Differences with dict

Performances

Documentation

fdict class

sfdict class

LICENCE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages