Movatterモバイル変換

chfoo/warcatPublic

NotificationsYou must be signed in to change notification settings
Fork20
Star162

Tool and library for handling Web ARChive (WARC) files.

License

GPL-3.0 license

162 stars 20 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
doc		doc
example		example
warcat		warcat
.gitignore		.gitignore
.travis.yml		.travis.yml
COPYING.txt		COPYING.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

WARCAT: Web ARChive (WARC) Archiving Tool

Tool and library for handling Web ARChive (WARC) files.

2024-10-11: Please have a look at a new projectwarcat-rs. It is written in Rust, peforms faster and more correctly, and is designed to work with other programs using JSON. Missing features and bugs mentioned here will also be addressed in the new project soon.

Quick Start

Requirements:

Python 3

Install stable version:

pip-3 install warcat

Or install latest version:

git clone git://github.com/chfoo/warcat.gitpip-3 install -r requirements.txtpython3 setup.py install

Example Run:

python3 -m warcat --helppython3 -m warcat list example/at.warc.gzpython3 -m warcat verify megawarc.warc.gz --progresspython3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress

Supported commands

concat: Naively join archives into one
extract: Extract files from archive
help: List commands available
list: List contents of archive
pass: Load archive and write it back out
split: Split archives into individual records
verify: Verify digest and validate conformance

Library

Example:

>>>importwarcat.model>>>warc=warcat.model.WARC()>>>warc.load('example/at.warc.gz')>>>len(warc.records)8>>>record=warc.records[0]>>>record.warc_type'warcinfo'>>>record.content_length233>>>record.header.version'1.0'>>>record.header.fields.list()[('WARC-Type','warcinfo'), ('Content-Type','application/warc-fields'), ('WARC-Date','2013-04-09T00:11:14Z'), ('WARC-Record-ID','<urn:uuid:972777d2-4177-4c63-9fde-3877dacc174e>'), ('WARC-Filename','at.warc.gz'), ('WARC-Block-Digest','sha1:3C6SPSGP5QN2HNHKPTLYDHDPFYKYAOIX'), ('Content-Length','233')]>>>record.header.fields['content-type']'application/warc-fields'>>>record.content_block.fields.list()[('software','Wget/1.13.4-2608 (linux-gnu)'), ('format','WARC File Format 1.0'), ('conformsTo','http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf'), ('robots','classic'), ('wget-arguments','"http://www.archiveteam.org/" "--warc-file=at" ')]>>>record.content_block.fields['software']'Wget/1.13.4-2608 (linux-gnu)'>>>record.content_block.payload.length0>>>bytes(warc)[:60]b'WARC/1.0\r\nWARC-Type: warcinfo\r\nContent-Type: application/war'>>>bytes(record.content_block.fields)[:60]b'software: Wget/1.13.4-2608 (linux-gnu)\r\nformat: WARC File Fo'

Note

The library may not be entirely thread-safe yet.

About

The goal of the Warcat project is to create a tool and library as easy and fast as manipulating any other archive such as tar and zip archives.

Warcat is designed to handle large, gzip-ed files by partially extracting them as needed.

Warcat is provided without warranty and cannot guarantee the safety of your files. Remember to make backups and test them!

Homepage:https://github.com/chfoo/warcat
Documentation:http://warcat.readthedocs.org/
Questions?:https://answers.launchpad.net/warcat
Bugs?:https://github.com/chfoo/warcat/issues
PyPI:https://pypi.python.org/pypi/Warcat/
Chat: irc://irc.efnet.org/archiveteam-bs (I'll be on #archiveteam-bs on EFnet)

Specification

This implementation is based loosely on draft ISO 28500 papersWARC_ISO_28500_version1_latestdraft.pdf andwarc_ISO_DIS_28500.pdf which can be found athttp://bibnum.bnf.fr/WARC/ .

File format

Here's a quick description:

A WARC file contains one or more Records concatenated together. Each Record contains Named Fields, newline, a Content Block, newline, and newline. A Content Block may be two types: {binary data} or {Named Fields, newline, and binary data}. Named Fields consists of string, colon, string, and newline.

A Record may be compressed with gzip. Filenames ending with.warc.gz indicate one or more gzip compressed files concatenated together.

Alternatives

Warcat is inspired by

Development

Testing

Always remember to test. Continue testing:

python3 -m unittest discover -p '*_test.py'nosetests3

To-do

Smart archive join
Regex filtering of records
Generate index to disk (eg, for fast resume)
Grab files like wget and archive them
See TODO and FIXME markers in code
etc.

About

Tool and library for handling Web ARChive (WARC) files.

Releases

10tags

Sponsor this project

Packages

No packages published

Contributors4

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

WARCAT: Web ARChive (WARC) Archiving Tool

Quick Start

Supported commands

Library

About

Specification

File format

Alternatives

Development

Testing

To-do

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

Uh oh!

License

chfoo/warcat

Folders and files

Latest commit

History

Repository files navigation

WARCAT: Web ARChive (WARC) Archiving Tool

Quick Start

Supported commands

Library

About

Specification

File format

Alternatives

Development

Testing

To-do

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages