- Notifications
You must be signed in to change notification settings - Fork15
Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.
License
EFForg/badger-sett
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Asett or set is a badger's den which usually consists of a network of tunnelsand numerous entrances. Setts incorporate larger chambers used for sleeping orrearing young.
This script is designed to raise youngPrivacy Badgers by teaching themabout the trackers on popular sites. Every day,crawler.py
visits thousands of the top sites from theTranco List with the latest version of Privacy Badger, and saves its findings inresults.json
.
See the following EFF.org blog post for more information:Giving Privacy Badger a Jump Start.
Install Python 3.8+
Create and activate a Python virtual environment:
python3 -m venv venvsource ./venv/bin/activatepip install -U pip
For more, readthis blog post.
Install Python dependencies with
pip install -r requirements.txt
Run static analysis with
prospector
Run unit tests with
pytest
Take a look at Badger Sett commandline flags with
./crawler.py --help
Git clone thePrivacy Badger repository somewhere
Try running a tiny scan:
./crawler.py firefox 5 --no-xvfb --log-stdout --pb-dir /path/to/privacybadger
Docker takes care of all dependencies, including setting up the latest browser version.
However, Docker brings its own complexity. Problems from improper file ownership and permissions are a particular pain point.
Prerequisites: haveDocker installed.Make sure your user is part of the
docker
group so that you can build andrun docker images withoutsudo
. You can add yourself to the group with$ sudo usermod -aG docker $USER
Clone the repository
$ git clone https://github.com/efforg/badger-sett
Run a scan
$ BROWSER=firefox ./runscan.sh 500
This will scan the top 500 sites on the Tranco list in Chromewith the latest version of Privacy Badger's master branch.
To run the script with a different branch of Privacy Badger, set the
PB_BRANCH
variable. e.g.$ PB_BRANCH=my-feature-branch BROWSER=firefox ./runscan.sh 500
You can also pass arguments to
crawler.py
, the Python script that doesthe actual crawl. Any arguments passed torunscan.sh
will beforwarded tocrawler.py
. For example, to exclude all websites endingwith .gov and .mil from your website visit list:$ BROWSER=edge ./runscan.sh 500 --exclude .gov,.mil
Monitor the scan
To have the scan print verbose output about which sites it's visiting, usethe
--log-stdout
argument.If you don't use that argument, all output will still be logged to
docker-out/log.txt
, beginning after the script outputs "Running scan inDocker..."
To set up the script to run periodically and automatically update therepository with its results:
Create a new ssh key with
ssh-keygen
. Give it a name unique to therepository.$ ssh-keygenGenerating public/private rsa key pair.Enter file in which to save the key (/home/USER/.ssh/id_rsa): /home/USER/.ssh/id_rsa_badger_sett
Add the new key as a deploy key with R/W access to the repo on Github.https://developer.github.com/v3/guides/managing-deploy-keys/
Add a SSH host alias for Github that uses the new key pair. Create or open
~/.ssh/config
and add the following:Host github-badger-sett HostName github.com User git IdentityFile /home/USER/.ssh/id_rsa_badger_sett
Configure git to connect to the remote over SSH. Edit
.git/config
:[remote "origin"] url = ssh://git@github-badger-sett:/efforg/badger-sett
This will have
git
connect to the remote using the new SSH keys by default.Create a cron job to call
runscan.sh
once a day. Set the environmentvariableRUN_BY_CRON=1
to turn off TTY forwarding todocker run
(whichwould break the script in cron), and setGIT_PUSH=1
to have the scriptautomatically commit and pushresults.json
when the scan finishes. Here's anexamplecrontab
entry:0 0 * * * RUN_BY_CRON=1 GIT_PUSH=1 BROWSER=chrome /home/USER/badger-sett/runscan.sh 6000 --exclude=.mil,.mil.??,.gov,.gov.??,.edu,.edu.??
If everything has been set up correctly, the script should push a new versionof
results.json
after each scan.
About
Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.