This repository was archived by the owner on Nov 29, 2022. It is now read-only.

catalyst-cooperative/pudl-zenodo-storagePublic archive

NotificationsYou must be signed in to change notification settings
Fork2
Star2

Tools for creating versioned archives of raw data on Zenodo using Frictionless data packages.

License

MIT license

2 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github		.github
src/pudl_zenodo_storage		src/pudl_zenodo_storage
tests		tests
.bandit.yml		.bandit.yml
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

PUDL Utils for Zenodo storage and packaging

Deprecated

This repo has been replaced by the newpudl-archiver repo, which combines both the scraping andd archiving process.

Background on Zenodo

Zenodo is an open repository maintained by CERN that allows usersto archive research-related digital artifacts for free. Catalyst uses Zenodo to archiveraw datasets scraped from the likes of FERC, EIA, and the EPA to ensure reliable,versioned access to the data PUDL depends on. Take a look at our archiveshere. In theevent that any of the publishers change the format or contents of their data, remove oldyears, or simply cease to exist, we will have a permanent record of the data. All datauploaded to Zenodo is assigned a DOI for streamlined access and citing.

Whenever the historical data changes substantially or new years are added, we make newZenodo archives and build out new versions of PUDL that are compatible. Paring specificZenodo archives with PUDL releases ensures a functioning ETL for users and developers.

Once created, Zenodo archives cannot be deleted. This is, in fact, their purpose! Italso means that one ought to be sparing with the information uploaded. We don't wantwade through tons of test uploads when looking for the most recent version of data.Luckily Zenodo has created a sandbox environment for testing API integration. Unlike theregular environment, the sandbox can be wiped clean at any time. When testing uploads,you'll want to upload to the sandbox first. Because we want to keep our Zenodo as cleanas possible, we keep the upload tokens internal to Catalyst. If there's data you want tosee integrated, and you're not part of the team, send us an email athello@catalyst.coop.

One last thing-- Zenodo archives for particular datasets are referred to as"depositions". Each dataset is it's own deposition that gets created when the dataset isfirst uploaded to Zenodo and versioned as the source releases new data that getsuploaded to Zenodo.

Installation

We recommend using mamba to create and manage your environment.

In your terminal, run:

$ mamba env create -f environment.yml$ mamba activate pudl-zenodo-storage

Adding a New Data Source

When you're adding an entirely new dataset to the PUDL, your first course of action isbuilding a scrapy script in the [pudl-scrapers](https://github.com/catalyst-cooperative/pudl-scrapers) repo. Once you've done that,you're ready to archive.

First, you'll need to fill in some metadata in thepudl repo. Start by adding a newkey value pair in theSOURCE dict in thepudl/metadata/source.py module. It's bestto keep the key (the source name) you choose simple and consistent across all repos thatreference the data. Once you've done this, you'll need to install your local version ofpudl (rather than the default version from GitHub). Doing this will allow the Zenodoarchiver script to process changes you made to thepudl repo.

While in thepudl-zenodo-storage environment, navigate to thepudl repo and run:

$ pip install -e ./

You don't need to worry about thefields.py module until you're ready to transform thedata in pudl.

Now, come back to this repo and create a module for the dataset in thefrictionlessdirectory. Give it the same name as the key you made for the data in the SOURCE dict.Use the existing modules as a model for your new one. The main function is calleddatapackager() and it serves to produce a json for the Zenodo archival collection.

Lastly, you need to:

Add archive metadata for the new dataset in thezs/metadata.py module. Thisincludes creating a UUID (universally unique identifier) for the data. UUIDs areused to uniquely distinguish the archive prior to the creation of a DOI. You can dothis using theuuid.uuid4() function that is part of the Python standard library.
Add the chosen deposition name to this list of acceptable names output with thezenodo_store --help flag. Seeparse_main() inzs.cli.py.
Add specifications for your new deposition in thearchive_selection() function alsoinzs.cli.py.

Updating an Existing Data Source

If updating an existing data source -- say, one that as released a new year's worth ofdata -- you don't need to add any new metadata to thepudl repo. Simply run the scraperfor the data and then run the Zenodo script as described below. The code was built todetect any changes in the data and automatically create a new version of the samedeposition when uploaded.

Running the Zenodo Archiver Script

Before you can archive data, you'll need to run the scrapy script you just created inthepudl-scrapers repo. Once you've scraped the data, then you can come back and runthe archiver. This script,zenodo_store gets defined as an entry point insetup.py.

Next, you'll need to defineZENODO_SANDBOX_TOKEN_UPLOAD andZENODO_TOKEN_UPLOADenvironment variables on your local machine. As mentioned above, we keep these valuesinternal to Catalyst so as to maintain a clean and reliable archive.

Thezenodo_store script requires you to include the name of the Zenodo deposition asan argument. This is a string value that indicates which dataset you're going to upload.Use the--help flag to see a list of supported strings. You can also find a list ofthe deposition names in thearchive_selection() function in thecli.py module.

When you're testing an archive, you'll want to make sure you use the Zenodo sandboxrather than the official Zenodo archive (see above for more info about the sandbox).Adding the--verbose flag will print out logging messages that are helpful fordebugging. Adding the--noop flag will show you whether your the data you scraped isany different from the data you already have uploaded to Zenodo without uploadinganything (so long as there is an existing upload to compare it to).

If the dataset is brand new, you'll also need to add the--initialize flag so that itknows to create a new deposition for the data.

Make sure a new deposition knows where to grab scraped data:

$ zenodo_store newdata --noop --verbose

Archive would contain: path/to/scraped/data

Compare a newly scraped deposition to the currently archived deposition of the samedataset. If you get the output depicted below then the archive data is the same as thescraped data, and you don't need to make a new version!

$ zenodo_store newdata --noop

{"create": {},"delete": {},"update": {}}

Test run a new deposition in the sandbox (the output link is fake!):

$ zenodo_store newdata --sandbox --verbose --initialize

Uploaded path/to/scraped/dataYour new deposition archive is ready for review at https://sandbox.zenodo.org/deposit/number

Once you're confident with your upload, you can go ahead and run the script without anyflags.

$ zenodo_store newdata

Repo Contents

zs

Thezs.ZenodoStorage class provides an interface to create archives and uploadfiles to Zenodo.

frictionless

Package metadata in dict formats, as necessary to support thefrictionless datapackagespecification.

About

Tools for creating versioned archives of raw data on Zenodo using Frictionless data packages.

Releases

No releases published

Contributors8

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PUDL Utils for Zenodo storage and packaging

Deprecated

Background on Zenodo

Installation

Adding a New Data Source

Updating an Existing Data Source

Running the Zenodo Archiver Script

Repo Contents

zs

frictionless

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors8

Uh oh!

Languages

Movatterモバイル変換

License

catalyst-cooperative/pudl-zenodo-storage

Folders and files

Latest commit

History

Repository files navigation

PUDL Utils for Zenodo storage and packaging

Deprecated

Background on Zenodo

Installation

Adding a New Data Source

Updating an Existing Data Source

Running the Zenodo Archiver Script

Repo Contents

zs

frictionless

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors8

Uh oh!

Languages