- Notifications
You must be signed in to change notification settings - Fork13
Directory structure standard for experimental pipelines.
License
CINPLA/exdir
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Important: The reference implementation contained in this repository is intended forfeedback and as a basis for future library implementations.It is not ready for production use.
Experimental Directory Structure (exdir) is a proposed, open specification forexperimental pipelines.Exdir is currently a prototype published to invite researchers to give feedback onthe standard.
Exdir is an hierarchical format based on open standards.It is inspired by already existing formats, such as HDF5 and NumPy,and attempts to solve some of the problems assosciated with these whileretaining their benefits.The development of exdir owes a great deal to the efforts of others to standardizedata formats in science in general and neuroscience in particular, among themthe Klusta Kwik Team and Neurodata Without Borders.
Exdir can be installed with Anaconda:
conda install exdir -c cinpla -c conda-forge
The following code creates an Exdir directory with a group and a dataset:
import numpy as npimport exdirexperiment = exdir.File("experiment.exdir")group = experiment.create_group("group")data = np.arange(10)dataset = group.create_dataset("dataset", data=data)
The data can be retrieved using the above used keys:
group = experiment["group"]dataset = group["dataset"]print(dataset)
Attributes can be added to all objects, including files, groups and datasets:
group.attrs["room_number"] = 1234dataset.attrs["recoring_date"] = "2018-02-04"
See thedocumentation for more information.
Seebenchmarks.ipynb.
Alive versioncan be explored using Binder.
Exdir is not a file format in itself, but rather a standardized folder structure.The abstract data model is almost equivalent to that of HDF5,with groups, datasets, and attributes.This was done to simplify the transition from either format.However, data in Exdir is not stored in a single file,but rather multiple files within the hierarchy.The metadata is stored in a restricted verison of the YAML 1.2 formatand the binary data in the NumPy 2.0 format.
Here is an example structure:
example.exdir (File, folder)│ attributes.yaml (-, file)│ exdir.yaml (-, file)│├── dataset1 (Dataset, folder)│ ├── data.npy (-, file)│ ├── attributes.yaml (-, file)│ └── exdir.yaml (-, file)│└── group1 (Group, folder)│ ├── attributes.yaml (-, file) └── exdir.yaml (-, file) │ ├── dataset3 (Dataset, folder) │ ├── data.npy (-, file) │ ├── attributes.yaml (-, file) │ └── exdir.yaml (-, file) │ ├── link1 (Link, folder) │ └── exdir.yaml (-, file) │ └── dataset4 (Dataset, folder) ├── data.npy (-, file) ├── attributes.yaml (-, file) ├── exdir.yaml (-, file) │ └── raw (Raw, folder) ├── image0001.tif (-, file) ├── image0002.tif (-, file) └── ...
The above structure shows the name of the object, the type of the object in exdir andthe type of the object on the file system as follows:
[name] ([EXP type], [file system type])
A dash (-) indicates that the object doesn't have a separate internalrepresentation in the format, but is used indirectly.It is however explicitly stored in the file system.
The above structure shows that theexample.exdir
file is simply a folder inthe file system, but when read by an exdir parser, it appears as aFile
.TheFile
is the root object of any structure.The metadata of theFile
is stored in a file named meta.yaml.This is internal to exdir.Attributes of theFile
is stored in a file named attributes.yaml.This is optional.
Below the file, multiple objects may appear, among themDataset
s andGroup
s.BothDataset
s andGroup
s are stored as folders in the file system.Both have their metadata stored in files named meta.yaml.These are not visible as files within the exdir format, but appear simply asthe metadata for theDataset
s andGroup
s.
If there is any additional data assosciated with the dataset,it may (optionally) be stored in a folder namedraw
.This differs from HDF5, but allows storing raw data from experiments (such asTIFF images from an external microscopy system) locally with the dataconverted to the NumPy format.
By reusing the structure of HDF5, exdir should be familiar to researchers thathave experience with this format.However, by not storing the data in a single file,the data is much less prone to corruption.Further, HDF5 is not optimal for modifications, parallelization or dataexploration.
By storing the data in separate files, we get the many benefits of modern filesystems in protection against data corruption.The data is more easily accessible in parallell computing and is stored ina well known and tested format.It is easier to explore the data by use of standard command line tools or simplythe file explorer.
However, we intend to develop a graphical user interface along the lines ofHDF5view that allows simple data exploration similar to this.
- Exdir should be based on existing open standards
- Exdir should not solve problems that have already been solved, such as storing binary data
- Exdir should be lightweight
Exdir was designed due to a need at the Centre for IntegrativeNeuroplasticity (CINPLA) at the University of Oslo for a format that wouldfit the experimental pipeline.While researching the different options, we found that the neurosciencecommunity had several formats for storing experimental data.A large effort at standardizing the format in the community was spawned byNeurodata Without Borders (NWB).An initial version of the NWB format was published, based on the HDF5 format.However, shortly after the first publication of NWB, concerns were voicedabout HDF5 format from the developers of the klusta project[1].They had been using HDF5 as the underlying file format for their software suiteand started seeing problems with the file format among their users.They saw multiple problems with HDF5 in the form of data corrpution, performanceissues, bugs and poor support for parallelization.
HDF5 is not optimal for modifications.This is not a problem if you only store data from acquisition,as this shouldn't be changed.However, for analysis it is often necessary to modify the data multiple times asdifferent methods and parameters are tested.At the same time, it is beneficial to keep the analysed data stored togetherwith the acquisition data.
About
Directory structure standard for experimental pipelines.