Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork20
Tool and library for handling Web ARChive (WARC) files.
License
chfoo/warcat
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Tool and library for handling Web ARChive (WARC) files.
2024-10-11: Please have a look at a new projectwarcat-rs. It is written in Rust, peforms faster and more correctly, and is designed to work with other programs using JSON. Missing features and bugs mentioned here will also be addressed in the new project soon.
Requirements:
- Python 3
Install stable version:
pip-3 install warcat
Or install latest version:
git clone git://github.com/chfoo/warcat.gitpip-3 install -r requirements.txtpython3 setup.py install
Example Run:
python3 -m warcat --helppython3 -m warcat list example/at.warc.gzpython3 -m warcat verify megawarc.warc.gz --progresspython3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress
- concat
- Naively join archives into one
- extract
- Extract files from archive
- help
- List commands available
- list
- List contents of archive
- pass
- Load archive and write it back out
- split
- Split archives into individual records
- verify
- Verify digest and validate conformance
Example:
>>>importwarcat.model>>>warc=warcat.model.WARC()>>>warc.load('example/at.warc.gz')>>>len(warc.records)8>>>record=warc.records[0]>>>record.warc_type'warcinfo'>>>record.content_length233>>>record.header.version'1.0'>>>record.header.fields.list()[('WARC-Type','warcinfo'), ('Content-Type','application/warc-fields'), ('WARC-Date','2013-04-09T00:11:14Z'), ('WARC-Record-ID','<urn:uuid:972777d2-4177-4c63-9fde-3877dacc174e>'), ('WARC-Filename','at.warc.gz'), ('WARC-Block-Digest','sha1:3C6SPSGP5QN2HNHKPTLYDHDPFYKYAOIX'), ('Content-Length','233')]>>>record.header.fields['content-type']'application/warc-fields'>>>record.content_block.fields.list()[('software','Wget/1.13.4-2608 (linux-gnu)'), ('format','WARC File Format 1.0'), ('conformsTo','http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf'), ('robots','classic'), ('wget-arguments','"http://www.archiveteam.org/" "--warc-file=at" ')]>>>record.content_block.fields['software']'Wget/1.13.4-2608 (linux-gnu)'>>>record.content_block.payload.length0>>>bytes(warc)[:60]b'WARC/1.0\r\nWARC-Type: warcinfo\r\nContent-Type: application/war'>>>bytes(record.content_block.fields)[:60]b'software: Wget/1.13.4-2608 (linux-gnu)\r\nformat: WARC File Fo'
Note
The library may not be entirely thread-safe yet.
The goal of the Warcat project is to create a tool and library as easy and fast as manipulating any other archive such as tar and zip archives.
Warcat is designed to handle large, gzip-ed files by partially extracting them as needed.
Warcat is provided without warranty and cannot guarantee the safety of your files. Remember to make backups and test them!
- Homepage:https://github.com/chfoo/warcat
- Documentation:http://warcat.readthedocs.org/
- Questions?:https://answers.launchpad.net/warcat
- Bugs?:https://github.com/chfoo/warcat/issues
- PyPI:https://pypi.python.org/pypi/Warcat/
- Chat: irc://irc.efnet.org/archiveteam-bs (I'll be on #archiveteam-bs on EFnet)
This implementation is based loosely on draft ISO 28500 papersWARC_ISO_28500_version1_latestdraft.pdf
andwarc_ISO_DIS_28500.pdf
which can be found athttp://bibnum.bnf.fr/WARC/ .
Here's a quick description:
A WARC file contains one or more Records concatenated together. Each Record contains Named Fields, newline, a Content Block, newline, and newline. A Content Block may be two types: {binary data} or {Named Fields, newline, and binary data}. Named Fields consists of string, colon, string, and newline.
A Record may be compressed with gzip. Filenames ending with.warc.gz
indicate one or more gzip compressed files concatenated together.
Warcat is inspired by
Always remember to test. Continue testing:
python3 -m unittest discover -p '*_test.py'nosetests3
- Smart archive join
- Regex filtering of records
- Generate index to disk (eg, for fast resume)
- Grab files like wget and archive them
- See TODO and FIXME markers in code
- etc.
About
Tool and library for handling Web ARChive (WARC) files.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.