- Notifications
You must be signed in to change notification settings - Fork14
Rewriting web proxy and archival tool. At this point, it just tries to download all the things.
License
fake-name/ReadableWebProxy
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.
This is arewriting proxy. In other words, it proxies arbitrary webcontent, while allowing the rewriting of the remote content as drivenby a set of rule-files. The goal is to effectively allow the completecustomization of any existing web-sites as driven by predefined rules.
Functionally, it's used for extracting just the actual content bodyof a site and reproducing it in a clean layout. It also modifiesall links on the page to point to internal addresses, so following alink points to the proxied version of the file, rather then the original.
While the above was the original scope, the project has mutated heavily. At thispoint, it has a complete web spider and archives entire websites to local storage.Additionally, multiple versions of each page are kept, with a overall rollingrefresh of the entire database at configurable intervals (configurable on aper-domain, or global basis).
There are also a lot of facilities responsible for feeding the releases/RSS viewsas part of wlnupdates.com.
Quick installation overview:
- Install Redis
- (optional) install InfluxDB
- (optional) install Graphite
- Install Postgresql>= 10.
- Build the community extensions for Postgresql.
- Create a database for the project.
- In the project database, install the
pg_trgmandcitextextensions from thecommunity extensions modules. - Copy
settings.example.pytosettings.py. - Fill in all settings in settings.py
- Setup virtualhost by running
build-venv.sh - Activate vhost:
source flask/bin/activate - Bootstrap DB:
alembic uprade head - (on another machine/session) Run local fetch RPC server
run_local.shfromhttps://github.com/fake-name/AutoTriever - Run server:
python3 run.py - If you want to run the spider, it has a LOT more complicated components:
- Main scraper is started by
python runScrape.py - Raw scraper is started by
python runScrape.py raw - Scraper periodic scheduler is started by
python runScrape.py scheduler - The scraper requires substantial RPC infrastructure. You will need:
- A RabbitMQ instance with a public DNS address
- A machine running saltstack + salt-master with a public DNS addressOn the salt machine, runhttps://github.com/fake-name/AutoTriever/tree/master/marshaller/salt_scheduler.py
- A variable number of RPC workers to execute fetch tasks. TheAutoTriever project can be used to manage these.
- A machine to run the RPC local demultiplexing agent (
run_agent.sh)The RPC agent allows multiple projects to use the RPC systemsimultaneously. Since the RPC system basically allows executingeither predefined jobs, or arbitrary code on the worker swarm. Thisis fairly useful in general, so I've implemented it as a servicethat multiple of my projects then use.
- Main scraper is started by
Ubuntu dependencies
- postgresql-common libpq-dev libenchant-dev
- probably more I've forgotten
About
Rewriting web proxy and archival tool. At this point, it just tries to download all the things.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.