Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ArchiveBot, an IRC bot for archiving websites

License

NotificationsYou must be signed in to change notification settings

ArchiveTeam/ArchiveBot

Repository files navigation

1. ArchiveBot    <SketchCow> Coders, I have a question.    <SketchCow> Or, a request, etc.    <SketchCow> I spent some time with xmc discussing something we could                do to make things easier around here.    <SketchCow> What we came up with is a trigger for a bot, which can                be triggered by people with ops.    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to                archive.org. Boom.    <SketchCow> I can supply machine as needed.    <SketchCow> Obviously there's some sanitation issues, and it is root                all the way down or nothing.    <SketchCow> I think that would help a lot for smaller sites    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty                simple.    <SketchCow> And just being able to go "bot, get a sanity dump"2. More infoArchiveBot has two major backend components: the control node, whichruns the IRC interface and bookkeeping programs, and the crawlers, whichdo all the Web crawling.  ArchiveBot users communicate with ArchiveBotby issuing commands in an IRC channel.User's guide:http://archivebot.readthedocs.org/en/latest/Control node installation guide: INSTALL.backendCrawler installation guide: INSTALL.pipeline3. Local useArchiveBot was originally written as a set of separate programs fordeployment on a server.  This means it has a poor distribution story.However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,dashboard, ignores, and control system and created a package intended forpersonal use.  You can find it athttps://github.com/ArchiveTeam/grab-site.4. LicenseCopyright 2013 David Yip; made available under the MIT license.  SeeLICENSE for details.5. AcknowledgmentsThanks to Alard (@alard), who added WARC generation and Lua scripting toGNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current webcrawler.Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns andtracking down performance problems at scale.Other thanks go to the following projects:* Celluloid <http://celluloid.io/>* Cinch <https://github.com/cinchrb/cinch/>* CouchDB <http://couchdb.apache.org/>* Ember.js <http://emberjs.com/>* Redis <http://redis.io/>* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>6. Special thanksDragonette, Barnaby Bright, Vienna Teng, NONONO.The memory hole of the Web has gone too far.Don't look down, never look away; ArchiveBot's like the wind. vim:ts=2:sw=2:tw=72:et

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp