warc
Here are 110 public repositories matching this topic...
Language:All
Sort:Most stars
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
- Updated
Nov 15, 2025 - Python
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
- Updated
Nov 26, 2025 - Java
Collect and revisit web pages.
- Updated
Jan 11, 2025 - Python
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
- Updated
Nov 4, 2025 - TypeScript
Run a high-fidelity browser-based web archiving crawler in a single Docker container
- Updated
Nov 28, 2025 - TypeScript
Serverless replay of web archives directly in the browser
- Updated
Nov 29, 2025 - TypeScript
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
- Updated
Oct 10, 2025 - Python
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
- Updated
Sep 17, 2020 - JavaScript
Streaming WARC/ARC library for fast web archive IO
- Updated
Dec 10, 2024 - Python
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
- Updated
Mar 12, 2025 - Roff
News crawling with StormCrawler - stores content as WARC
- Updated
Feb 19, 2025 - Java
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
- Updated
Nov 28, 2025 - TypeScript
Bitextor generates translation memories from multilingual websites
- Updated
Nov 11, 2024 - Python
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
- Updated
Feb 11, 2025 - Python
Chrome extension to "Create WARC files from any webpage"
- Updated
Dec 6, 2023 - JavaScript
CoCrawler is a versatile web crawler built using modern tools and concurrency.
- Updated
Apr 29, 2022 - Python
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
- Updated
Nov 21, 2025 - Python
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
- Updated
Oct 8, 2025 - Scala
Improve this page
Add a description, image, and links to thewarc topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thewarc topic, visit your repo's landing page and select "manage topics."