Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Make a ZIM file from any Web site and surf offline!

License

NotificationsYou must be signed in to change notification settings

openzim/zimit

Repository files navigation

Zimit is a scraper allowing to create ZIM file from any Web site.

CodeFactorLicense: GPL v3Docker

Zimit adheres to openZIM'sContribution Guidelines.

Zimit has implemented openZIM'sPython bootstrap, conventions and policiesv1.0.1.

Capabilities and known limitations

While we would like to support as many websites as possible, making an offline archive of any website with a versatile tool obviously has some limitations.

Most capabilities and known limitations are documented inwarc2zim README. There are also some limitations in Browsertrix Crawler (used to fetch the website) and wombat (used to properly replay dynamic web requests), but these are not (yet?) clearly documented.

Technical background

Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.

The system:

Thezimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the/output directory, which should be mounted as a volume to not loose the ZIM created when container stops.

Using the--keep flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside/output

Usage

zimit is intended to be run in Docker. Docker image is published athttps://github.com/orgs/openzim/packages/container/package/zimit.

The image accepts the following parameters,as well as any of theBrowsertrix crawler andwarc2zim ones:

  • Required:--seeds URL - the url to start crawling from ; multiple URLs can be separated by a comma (even ifusually not needed, these are just theseeds of the crawl) ; first seed URL is used as ZIM homepage
  • Required:--name - Name of ZIM file
  • --output - output directory (defaults to/output)
  • --pageLimit U - Limit capture to at most U URLs
  • --scopeExcludeRx <regex> - skip URLs that match the regex from crawling. Can be specified multiple times. An example is--scopeExcludeRx="(\?q=|signup-landing\?|\?cid=)", where URLs that contain either?q= orsignup-landing? or?cid= will be excluded.
  • --workers N - number of crawl workers to be run in parallel
  • --waitUntil - Puppeteer setting for how long to wait for page load. Seepage.goto waitUntil options. The default isload, but for static sites,--waitUntil domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
  • --keep - in case of failure, WARC files and other temporary files (which are stored as a subfolder of output directory) are always kept, otherwise they are automatically deleted. Use this flag to always keep WARC files, even in case of success.

Example command:

docker run ghcr.io/openzim/zimit zimit --helpdocker run ghcr.io/openzim/zimit warc2zim --helpdocker run  -v /output:/output ghcr.io/openzim/zimit zimit --seeds URL --name myzimfile

Note: Image automatically filters out a large number of ads by using the 3 blocklists fromanudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" ghcr.io/openzim/zimit ...).

To re-build the Docker image locally run:

docker build -t ghcr.io/openzim/zimit.

FAQ

The Zimit contributor's team maintainsa page with most Frequently Asked Questions.

Nota bene

While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the casesince Zimit 2.x which does not have any special requirements anymore.

It should also be noted that a first version of a generic HTTP scraper was created in 2016 duringtheWikimania Esino LarioHackathon.

That version is now considered outdated andarchived in2016branch.

License

GPLv3 or later, seeLICENSE for more details.


[8]ページ先頭

©2009-2025 Movatter.jp