- Notifications
You must be signed in to change notification settings - Fork50
Description
This issue is to sketch out some ideas and start a discussion related to retrieving and saving datasets. This follows from some prior discussion during working group meetings.
Local caching of records:
When accessing records from the archive, it would be very helpful to be able to store this data locally in a cache. It seems like this could come in two distinct flavors:
Automatic caching.
I'm looking at the current source code and there appears to be some framework already in place (but maybe not yet implemented?) that relies upon DBM for the automatic caching. If implemented, this would allow QCPortal to check the local cache to see if a given record from a specified server has already been retrieved, and if so, use the local version. This would certainly be very beneficial since it would mean that for many users, rerunning a python script or restarting a notebook kernel would not require re-downloading data. However, the actual performance will depend upon the amount of memory allocated to the cache and the size of a given dataset a user is working with.
User-defined caching.
This would povide the same basic functionality as the automatic cacheing, but allowing a user to define the location to store a database, where by default, the cache does not have a maximum size limit. This would be beneficial to users that are working with, say, entire datasets. For example, if say, working with the QM9 dataset, I would only like to basically download the records once and be able to store them locally for ease of access later; I don't want to have to worry about the dataset records being purged (due to downloading other data from qcarchive) or just simply having the dataset being larger than the default memory allocation. In my own work, I've implemented a simple wrapper around the calls to QCPortal where each record is saved into an SQLdict database and this has been very helpful, especially in cases where I lose connection to the database.
Ability to download entire datasets:
Some of the datasets in the older version included HDF5 files (that could be downloaded either via the portal or from zenodo). This allowed an entire dataset to be downloaded very efficiently. As an example, it would take about 5 minutes to download QM9 in the hdf5 format (~160 mb when gzipped) for ~133K records; fetching these records one at a time (using the new code) took > 12 hours. Having a way to download an entire dataset in one file would be very helpful.
- A few related notes: to even download this required wrapping the calls to qcportal in try/except statements to automatically reconnect, as periodically I would lose connection. I was able to speed this up to about 15 minutes by using concurrent.futures to multithread the per-record fetches. If a single downloaded file is not possible for a dataset (e.g., given that datasets may be changing), it would be good to have some way to efficiently download all the records at a single time, saved to some local cached database. Either way, calling "get_record" in a loop at this point is not very efficient and could be a big stumbling block for users.