- Notifications
You must be signed in to change notification settings - Fork0
Privacy-preserving Firefox extension linking to Hacker News discussion; built with Bloom filters and WebAssembly
License
jstrieb/hackernews-button
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Firefox extension that links to theHacker Newsdiscussion for the current page and preserves privacy with Bloom filters.
Install the browser extension from one of the following sources:
The extension will light up bright orange when the current page has previouslybeen posted to Hacker News.
- Clicking the extension will open the Hacker News discussion.
- Clicking the extension with the scroll wheel will open the discussion in anew tab.
- Clicking while holdingCtrl orShift will open thediscussion in a new tab or window, respectively.
There are also keyboard shortcuts.
- Alt +Y opens the Hacker News discussion in the currentpage
- Ctrl +Shift +Y opens the discussion in anew tab.
Star this project if you like it!
Read theHacker News Discussionfor this project.
When you visit a website, this browser extension determines whether the websitehas been submitted to Hacker News. A naive (but effective) way to do this is toquery the very helpfulAlgolia Search API for HackerNews with every page visited. In fact, that's whatthe original version of this extension did when I wrote it over the summer of2020! Unfortunately, there are two problems with this naive approach: youreveal every website you visit to Algolia, and you waste bandwidth and energysending and receiving extraneous API requests.
To solve this problem, this extension uses a data structure called aBloomfilter to protect your privacy.Bloom filters can be thought of as a super condensed representation of thefingerprints of a long list of URLs. In this way, you can download the Bloomfilter once (with periodic updates), and check if it contains the currentwebsite's URL fingerprint without making any requests over the Internet.
Click to read Bloom filter parameter details
Bloom filters are probabilistic data structures, which means that when youquery whether a string is in the set represented by the Bloom filter, theresponse from the data structure is either "no," or "probably yes." Bloomfilters have two parameters that can be tuned to minimize the likelihood offalse positive results: the size of the filter (the number of bits), and thenumber of hashes used to obtain a fingerprint of each item.
Based on calculations performed using thisBloom filtercalculator, the Bloomfilters used by this Firefox extension occupy 16MB of space and use 23 hashfunctions. Since (at the time of this release) there are approximately 4million submitted Hacker News stories, this gives a 1 in 10 million chance of afalse positive match on the Bloom filter. This probability gradually increasesto 1 in 26,000 as the number of submissions approaches 6 million, and becomes 1in 850 by the time there have been 8 million Hacker News story submissions. Atthat point, it will likely be worthwhile to consider increasing the size of theBloom filter.
16MB was chosen as the Bloom filter size, and the number of hashes was adjustedaround it. This size is convenient because it is not too large for an initialdownload of multiple Bloom filters. Additionally, 16MB Bloom filtersrepresenting smaller time windows (e.g. submissions from the last 24 hours) arevery sparse, and thus compress extremely well. For example, the Bloom filterrepresenting submissions from the last 24 hours compresses from 16MB to about50KB. Though the false positive rate could be further reduced andfuture-proofed, doubling the Bloom filter size to 32MB is a significantincrease, even with compression.
If the current page has been on Hacker News, the extension lights up andbecomes clickable. Clicking it retrieves a link to the best discussion for thepage and navigates the browser there.
By default, the extension uses several Bloom filters to show a lower-bound onthe score for each page. This can be easily disabled from the "Options" pagefor the extension, accessible by going toabout:addons
. It might be desirableto disable this if using multiple filters is too resource-intensive.
Click to read more about score thresholds
It seemed reasonable to use at most five distinct Bloom filters. Because theybecome increasingly sparse as the number of stories in the Bloom filterdecreases, they compress well, so adding additional Bloom filters doesn't havea massive impact on the total amount of data downloaded.
On the other hand, uncompressed, they total5 * 16MB = 80MB
in memory – morethan this seemed unreasonable.
The five thresholds for the Bloom filters were chosen mostly by eye, butvalidated and tuned using analysis of the dataset.
Range | Count |
---|---|
0-10 | 3381917 |
10-75 | 300300 |
75-250 | 121291 |
250-500 | 25739 |
500+ | 7948 |
As of February 28, 2021, the ranges have an approximately logarithmicallydecreasing number of entries. This is desirable because this mirrors the truedistribution of the data, which is also approximately logarithmic. It alsoallows for acceptably sensible, informative score ranges.
The data used for this analysis can be viewedhere.It was generated with the following BigQuery SQL query, and the thresholds weretuned in the spreadsheet.
SELECT score,COUNT(score)AS countFROM`bigquery-public-data.hacker_news.full`WHERE scoreIS NOT NULLAND score!=0GROUP BY scoreORDER BY score
You still send data to Algolia when you click the extension to visit thediscussion. The improvement offered by using Bloom filters is to not sendallof the sites you visit to the API, butsome data still need to be sent toretrieve the link to the discussion. Moreover, by default an updated Bloomfilter is downloaded once every 24 hours from GitHub. It is possible thatGitHub maintains logs of who downloads these releases.
Browser extensions have a lot of power to harm users, so it is important tounderstand what you are running. To that end, I provide a description of how toread this code. Please audit the code before running it.
This repository has three parts:
- Code to pull Hacker News data and generate Bloom filters from it
- Code for the browser extension
- A Bloom filter library used by the Bloom filter generator and the browserextension – just one implementation used by both parts of the project
Each of the three individual parts of the code are described in greater depthbelow. Click "Details" to read more.
TheMakefile
is used for almost all parts of the code, and is a good place to start readingto understand how everything fits together.
Details
Files to read:
The code for Bloom filters is implemented in C. This code is used in acommand-line C program to generate Bloom filters, which is compiled usinggcc
. It is also used by the browser extension in a wrapper library, which iscompiled to WebAssembly using Emscripten (emcc
in theMakefile
).
Thetest
folder includes tests for various parts of the Bloom filter library to ensureit is working as expected.
Files to read:
Bloom filters are regularly regenerated on a schedule, mediated by a GitHubActions workflow. At a high level, this process pulls down relevant data fromtheHacker News BigQuerydataset,does some preprocessing, normalizes ("canonicalizes") URLs, and feeds them tothe command-line Bloom filter generator. Generated Bloom filters are uploadedasGitHub Releases sousers running the extension can download the latest ones.
Since Bloom filters can only match exact strings, it is helpful to"canonicalize" URLs so that there are fewer false negative results. In otherwords, because multiple URLs often point to the same page,canonicalize.py
is useful for ensuring that slightly different URLs submitted to Hacker Newsfor the current page still match in the Bloom filter. Unfortunately, thisprocess is inherently imperfect. Opening issues with suggested improvements tothe URL canonicalization process are appreciated!
For actually reading strings, adding them to Bloom filters, and writing(compressed) Bloom filters, we compile and usebloom-create.c
.This takes some command-line arguments, and then reads from standard input,parses the line-delimited strings, and outputs a Bloom filter.
Files to read:
Themanifestconnects all parts of the extension together. It attaches keyboard commands toevents and runs a page with background scripts, which do most of the heavylifting. It also runs a small content script onnews.ycombinator.com
pages.
There are two important background scripts.background.js
is responsible for displaying the browser extension and handling userinteraction.bloom-wrap.js
makes the Bloom filter library (implemented in C) easily accessible fromJavaScript via low-level wrappers and high-level helper functions. It alsoincludes code that, when the browser starts and WebAssembly is ready, attemptsto either load a Bloom filter from local storage, or download the latest onefrom GitHub.
The content script that runs onnews.ycombinator.com
pages extracts "story"URLs from the pages and adds them to the Bloom filter. This is useful becausethe Bloom filters only update every 24 hours at most (as limited by thefrequency of BigQuery dataset updates), so adding stories to the Bloom filterthis way makes it possible to use the extension to view the discussion forrecently-submitted posts. This would otherwise not be possible until the Bloomfilter is updated many hours later.
Note that thebackground.html
page also loads a scriptbloom.js
that is notin the repo. As per theMakefile
,this script is compiled from the Bloom filter C library using Emscripten.
This project is actively developed and maintained. If there have not beencommits long after the initial release, everything is probably runningsmoothly!
The project is designed so that even if something were to happen to me, as longas my GitHub account is open, the Actions workflow should continue to releaseupdated Bloom filters.
I will do my best to address issues in a timely fashion, but I'm busy and thisis a side-project. Unsolicited pull requests are likely to be ignored. This isbecause releasing a browser extension means I have a (moral, notlegal– see theLICENSE)responsibility for the security of everyone who installs it. As a result,vetting random pull requests is typically not worth the effort unless theyaddress an issue that has been discussed beforehand. I'm happy to have others'support, just ask first – open an issue to do so.
- Fork your own copy of the repository
- Create a new project inBigQuery
- Create a service account with the
BigQuery User
permission - Generate a JSON key
- Enable Actions for the repository
- Copy the JSON key into an Actions secret called
BQ_JSON
(under Settings >Secrets > Actions) - Make your fork public if you want to be able to access it unauthenticated
- Change the repo to your liking, maintaining attribution and the LICENSE file!
- Change the repo to your liking, maintaining attribution and the LICENSEfile!
There is currently no version of this extension for Google Chrome. To readmore and discuss, check out the relevant issue(#1).
TheURLcanonicalizationis highly imperfect. There will inevitably be false negatives in Bloom filterresults. Suggestions for improving canonicalization in general, or forspecific sites, are welcome!
If the button is clicked, Algolia search tries to return the "best"submission for a given URL. Often this is not the latest submission, but theone with the most points.
This also means that if the button is clicked for very recently submittedstories (when browsingnew, forexample), Algolia may not have indexed the story yet, causing the redirect tofail.
On my computer, the plus signs in the badge text gets cut off for three-digitscores (#2).
There are a few things you can do to support the project:
- Star the repository (and follow me on GitHub for more)
- Share and upvote on sites like Twitter, Reddit, and Hacker News
- Report any bugs, glitches, or errors that you find
These things motivate me to to keep sharing what I build, and they providevalidation that my work is appreciated! They also help me improve the project.Thanks in advance!
If you are insistent on spending money to show your support, I encourage you toinstead make a generous donation to one of the following organizations. Byadvocating for Internet freedoms, organizations like these help me to feelcomfortable releasing work publicly on the Web.
This project is not affiliated with Hacker News, Y Combinator, or any YCombinator-backed company.
This project would not exist in its current form without:
- Daniel Gackle (dang)
- Logan Snow (@lsnow99)
- Amy Liu
- Hacker News
- Thomas Hurst'sBloom filter calculator
- zlib
- MurmurHash and Austin Appleby
- GitHub Actions
- BigQuery
- Algolia Hacker News Search
- Anyone who has asked or answered a helpful question on StackOverflow
- Mozilla Developer Networkdocumentation – mysine qua non for writing anything for the Web, includingbrowser extensions
About
Privacy-preserving Firefox extension linking to Hacker News discussion; built with Bloom filters and WebAssembly