- Notifications
You must be signed in to change notification settings - Fork4
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
License
mikwielgus/forum-dl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Forum-dl is a scraper and archiver for forums (including Discourse, PhpBB, SMF), mailing lists, and news aggregators (list). It can be used to extract and archive all posts from individual threads and entire boards into JSONL, Mbox, Maildir, WARC (complete list).
The project is currently in alpha stage. Please do not hesitate tofile bug reports and feature requests.
You can install stable Forum-dl fromPIP or directly from therepository. Minimum Python version is 3.10.11.
Install the latest stable version from PIP:
pip install forum-dl
Clone the repository and install the development branch in editable mode:
git clone https://github.com/mikwielgus/forum-dl && pip install -e forum-dl
Download a Simple Machines Forum thread in JSONL format:
forum-dl "https://www.simplemachines.org/community/index.php?topic=584230.0"
Save all images from the same thread in directoryfiles
:
forum-dl --files-output files "https://www.simplemachines.org/community/index.php?topic=584230.0"
Download a PhpBB subboard into JSONL format, write to stdout (-o -
) and record a WARC file inphpbb.warc
:
forum-dl --warc-output phpbb.warc "https://www.phpbb.com/community/viewforum.php?f=696"
(due to current architectural limitations,forum-dl
will scan the first page of each board in the entire forum before downloading the target board. This will be fixed in future releases)
Download Hacker News top stories and write them to a Maildir directoryhn
:
forum-dl --textify --content-as-title -f maildir -o hn "https://news.ycombinator.com/news"
--textify
converts HTML to plaintext (useful for text-only mail clients),--content-as-title
puts the beginning of each message's content in its title (useful for mail clients that don't display content in index view),-f maildir
changes the output format tomaildir
,-o hn
changes the output directory name tohn
.
- Discourse
- Hacker News
- Hyperkitty
- Hypermail
- Invision Power Board
- PhpBB
- Pipermail
- Proboards
- Simple Machines Forum
- vBulletin
- Xenforo
- Babyl
- JSONL
- Maildir
- Mbox
- MH
- MMDF
- WARC
forum-dl [--help] [--version] [--list-extractors] [--list-output-formats] [--timeout SECONDS] [-R N] [--retry-sleep SECONDS] [--retry-sleep-multiplier K] [--user-agent UA] [-q] [-v] [-g] [-o OUTFILE] [-f FORMAT] [--warc-output FILE] [--files-output DIR] [--boards | --no-boards] [--threads | --no-threads] [--posts | --no-posts] [--files | --no-files] [--outside-files | --no-outside-files] [--textify] [--content-as-title] [--author-as-addr-spec]
--help Show this help message and exit --version Print program version and exit --list-extractors List all supported extractors and exit --list-output-formats List all supported output formats and exit
--timeout SECONDS HTTP connection timeout -R N, --retries N Maximum number of retries for failed HTTP requests or -1 to retry infinitely (default: 4) --retry-sleep SECONDS Time to sleep between retries, in seconds (default: 1) --retry-sleep-multiplier K A constant by which sleep time is multiplied on each retry (default: 2) --user-agent UA User-Agent request header
-q, --quiet Activate quiet mode -v, --verbose Print various debugging information -g, --get-urls Print URLs instead of downloading -o OUTFILE, --output OUTFILE Output all results concatenated to OUTFILE, or stdout if OUTFILE is - (default: -) -f FORMAT, --output-format FORMAT Output format. Use --list-output-formats for a list of possible arguments --warc-output FILE Record HTTP requests, store them in FILE in WARC format --files-output DIR Store files in DIR instead of OUTFILE --boards, --no-boards Write board objects (default: True, --no-boards to negate) --threads, --no-threads Write thread objects (default: True, --no-threads to negate) --posts, --no-posts Write post objects (default: True, --no-posts to negate) --files, --no-files Write embedded files (--no-files to negate) --outside-files, --no-outside-files Write embedded files outside post content. Auto-enabled by --warc-output and -f warc (default: False, --no- outside-files to negate) --textify Lossily convert HTML content to plaintext --content-as-title Write 98 initial characters of content in title field of each post --author-as-addr-spec Append author and domain as an addr-spec in the From header
About
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC