catseye/yastasotiPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star6

MIRROR ofhttps://codeberg.org/catseye/yastasoti : Yet another script to archive stuff off teh internets

License

Unlicense license

6 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
images		images
script		script
Dockerfile		Dockerfile
README.md		README.md
UNLICENSE		UNLICENSE
build-docker-image.sh		build-docker-image.sh
requirements.txt		requirements.txt

Repository files navigation

`yastasoti`

Version 0.4|Entry@ catseye.tc|See also:ellsync∘tagfarm∘shelf

Yet another script to archive stuff off teh internets.

It's not a spider that automatically crawls previously undiscovered webpages — it's intendedto be run by a human to make backups of resources they have already seen and recorded the URLs of.

It was split off fromFeedmark, which doesn't itself need to support this function.

Features

input is a JSON list of objects containing links (such as those produced by Feedmark)
output is a JSON list of objects that could not be retrieved, which can be fed backinto the script as input
checks links withHEAD requests by default.--archive-to causes each link to befetched withGET and saved to the specified directory.--archive-via specifies anarchive router which causes each link to be fetched, and saved to a directorywhich is selected based on the URL of the link.
tries to be idempotent and not create a new local file if the remote file hasn't changed
handles links that are local files; checks if the file exists locally
can log its actions verbosely to a specified logfile
source code is a single, public-domain file with a single dependency (requests)

Examples

Check all links in a set of Feedmark documents

feedmark --output-links article/*.md | yastasoti --extant-path=article/ - | tee results.json

This will make onlyHEAD requests to check that the resources exist.It will not fetch them. The ones that could not be fetches will appearinresults.json, and you can run yastasoti on that again to re-try:

yastasoti --extant-path=article/ results.json | tee results2.json

Archive stuff off teh internets

cat >links.json << EOF[    {        "url": "http://catseye.tc/"    }]EOFyastasoti --archive-to=downloads links.json

Override the filename the stuff is archived as

By default, the subdirectory and filename to which the stuff is archived arebased on the site's domain name and the stuff's path. The filename, however,can be overridden if the input JSON contains adest_filename field.

cat >links.json << EOF[    {        "url": "http://catseye.tc/",        "dest_filename": "home_page.html"    }]EOFyastasoti --archive-to=downloads links.json

Categorize archived materials with a router

An archive router (used with--archive-via) is a JSON file that looks like this:

{    "http://catseye.tc/*": "/dev/null",    "https://footu.be/*": "footube/",    "*": "archive/"}

If a URL matches more than one pattern, the longest pattern will be selected.If the destination is/dev/null it will be treated specially — the file willnot be retrieved at all. If no pattern matches, an error will be raised.

To use an archive router once it has been written:

yastasoti --archive-via=router.json links.json

Requirements

Tested under Python 2.7.12. Seems to work under Python 3.5.2 as well,but this is not so official.

Requiresrequests Python library to make network requests. Testedwithrequests version 2.21.0.

Iftqdm Python library is installed, will display a nice progress bar.

(Or, if you would like to use Docker, you can pull a Docker image fromcatseye/yastasoti on Docker Hub,following the instructions given on that page.)

TODO

Archive youtube links with youtube-dl.
Handle failures (redirects, etc) better (detect 503 / "connection refused" better.)
Allow use of an external tool likewget orcurl to do fetching.

About

MIRROR ofhttps://codeberg.org/catseye/yastasoti : Yet another script to archive stuff off teh internets

catseye.tc/node/yastasoti

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

`yastasoti`

Features

Examples

Check all links in a set of Feedmark documents

Archive stuff off teh internets

Override the filename the stuff is archived as

Categorize archived materials with a router

Requirements

TODO

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors2

Uh oh!

Languages