Movatterモバイル変換

ArchiveTeam/ArchiveBotPublic

NotificationsYou must be signed in to change notification settings
Fork76
Star390

ArchiveBot, an IRC bot for archiving websites

www.archiveteam.org/index.php?title=ArchiveBot

License

MIT license

390 stars 76 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2,279 Commits
bot		bot
cogs		cogs
config		config
dashboard		dashboard
db		db
doc		doc
lib		lib
ops		ops
pipeline		pipeline
plumbing		plumbing
spec		spec
test		test
uploader		uploader
viewer		viewer
viewer2		viewer2
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
INSTALL.backend		INSTALL.backend
INSTALL.pipeline		INSTALL.pipeline
LICENSE		LICENSE
README		README
README.pipeline-recovery		README.pipeline-recovery
Rakefile		Rakefile

Repository files navigation

1. ArchiveBot    <SketchCow> Coders, I have a question.    <SketchCow> Or, a request, etc.    <SketchCow> I spent some time with xmc discussing something we could                do to make things easier around here.    <SketchCow> What we came up with is a trigger for a bot, which can                be triggered by people with ops.    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to                archive.org. Boom.    <SketchCow> I can supply machine as needed.    <SketchCow> Obviously there's some sanitation issues, and it is root                all the way down or nothing.    <SketchCow> I think that would help a lot for smaller sites    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty                simple.    <SketchCow> And just being able to go "bot, get a sanity dump"2. More infoArchiveBot has two major backend components: the control node, whichruns the IRC interface and bookkeeping programs, and the crawlers, whichdo all the Web crawling.  ArchiveBot users communicate with ArchiveBotby issuing commands in an IRC channel.User's guide:http://archivebot.readthedocs.org/en/latest/Control node installation guide: INSTALL.backendCrawler installation guide: INSTALL.pipeline3. Local useArchiveBot was originally written as a set of separate programs fordeployment on a server.  This means it has a poor distribution story.However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,dashboard, ignores, and control system and created a package intended forpersonal use.  You can find it athttps://github.com/ArchiveTeam/grab-site.4. LicenseCopyright 2013 David Yip; made available under the MIT license.  SeeLICENSE for details.5. AcknowledgmentsThanks to Alard (@alard), who added WARC generation and Lua scripting toGNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current webcrawler.Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns andtracking down performance problems at scale.Other thanks go to the following projects:* Celluloid <http://celluloid.io/>* Cinch <https://github.com/cinchrb/cinch/>* CouchDB <http://couchdb.apache.org/>* Ember.js <http://emberjs.com/>* Redis <http://redis.io/>* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>6. Special thanksDragonette, Barnaby Bright, Vienna Teng, NONONO.The memory hole of the Web has gone too far.Don't look down, never look away; ArchiveBot's like the wind. vim:ts=2:sw=2:tw=72:et