Movatterモバイル変換

fake-name/ReadableWebProxyPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star203

Rewriting web proxy and archival tool. At this point, it just tries to download all the things.

License

BSD-3-Clause license

203 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,677 Commits
.ipynb_checkpoints		.ipynb_checkpoints
FetchAgent		FetchAgent
Misc		Misc
RawArchiver		RawArchiver
WebMirror		WebMirror
alembic		alembic
amqpstorm		amqpstorm
app		app
common		common
ndscheduler @ 8b1ec1e		ndscheduler @ 8b1ec1e
rabbit_pub_cert		rabbit_pub_cert
scheduled_jobs		scheduled_jobs
sqlalchemy_continuum_vendored		sqlalchemy_continuum_vendored
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.pyup.yml		.pyup.yml
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
TODO.md		TODO.md
activePlugins.py		activePlugins.py
activeScheduledTasks.py		activeScheduledTasks.py
alembic.ini		alembic.ini
cloc.sh		cloc.sh
config.py		config.py
diff_test.py		diff_test.py
flags.py		flags.py
logSetup.py		logSetup.py
manage.py		manage.py
requirements.txt		requirements.txt
run.py		run.py
runFetchAgent.py		runFetchAgent.py
runScheduler.py		runScheduler.py
runScrape.py		runScrape.py
runStatus.py		runStatus.py
runTitleTests.py		runTitleTests.py
run_agent.sh		run_agent.sh
run_fetcher.sh		run_fetcher.sh
run_raw_fetcher.sh		run_raw_fetcher.sh
run_scheduler.sh		run_scheduler.sh
run_web.sh		run_web.sh
script.py.mako		script.py.mako
settings.py.example		settings.py.example
settings_sched.py.example.py		settings_sched.py.example.py

Repository files navigation

Readable-Web Proxy

Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.

This is arewriting proxy. In other words, it proxies arbitrary webcontent, while allowing the rewriting of the remote content as drivenby a set of rule-files. The goal is to effectively allow the completecustomization of any existing web-sites as driven by predefined rules.

Functionally, it's used for extracting just the actual content bodyof a site and reproducing it in a clean layout. It also modifiesall links on the page to point to internal addresses, so following alink points to the proxied version of the file, rather then the original.

While the above was the original scope, the project has mutated heavily. At thispoint, it has a complete web spider and archives entire websites to local storage.Additionally, multiple versions of each page are kept, with a overall rollingrefresh of the entire database at configurable intervals (configurable on aper-domain, or global basis).

There are also a lot of facilities responsible for feeding the releases/RSS viewsas part of wlnupdates.com.

Quick installation overview:

Install Redis
(optional) install InfluxDB
(optional) install Graphite
Install Postgresql>= 10.
Build the community extensions for Postgresql.
Create a database for the project.
In the project database, install thepg_trgm andcitext extensions from thecommunity extensions modules.
Copysettings.example.py tosettings.py.
Fill in all settings in settings.py
Setup virtualhost by runningbuild-venv.sh
Activate vhost:source flask/bin/activate
Bootstrap DB:alembic uprade head
(on another machine/session) Run local fetch RPC serverrun_local.sh fromhttps://github.com/fake-name/AutoTriever
Run server:python3 run.py
If you want to run the spider, it has a LOT more complicated components:
- Main scraper is started bypython runScrape.py
- Raw scraper is started bypython runScrape.py raw
- Scraper periodic scheduler is started bypython runScrape.py scheduler
- The scraper requires substantial RPC infrastructure. You will need:
  - A RabbitMQ instance with a public DNS address
  - A machine running saltstack + salt-master with a public DNS addressOn the salt machine, runhttps://github.com/fake-name/AutoTriever/tree/master/marshaller/salt_scheduler.py
  - A variable number of RPC workers to execute fetch tasks. TheAutoTriever project can be used to manage these.
  - A machine to run the RPC local demultiplexing agent (run_agent.sh)The RPC agent allows multiple projects to use the RPC systemsimultaneously. Since the RPC system basically allows executingeither predefined jobs, or arbitrary code on the worker swarm. Thisis fairly useful in general, so I've implemented it as a servicethat multiple of my projects then use.

Ubuntu dependencies

postgresql-common libpq-dev libenchant-dev
probably more I've forgotten

About

Rewriting web proxy and archival tool. At this point, it just tries to download all the things.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Readable-Web Proxy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

fake-name/ReadableWebProxy

Folders and files

Latest commit

History

Repository files navigation

Readable-Web Proxy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages