UrlPatternIndex
overviewThe UrlPatternIndex component can be used to build an index over a set of URL rules, and speed up matching network requests against these rules.
A URL rule (seeflat::UrlRule
structure) describes a subset of network requests that it targets. The essential element of the rule is its URL pattern, which is a simplified regular expression (a string with wildcards).UrlPatternIndex
is mainly based on text fragments extracted from the patterns.
The component uses theFlatBuffers serialization library to represent the rules and the index. The key advantage of the format is that it does not require deserialization. Once built, the data structure can be stored on disk or transferred, then copied/loaded/memory-mapped and used directly.
UrlPattern
sThe component is built around an underlying concept of a URL pattern, defined in the classUrlPattern
. These patterns are largely inspired by patterns inEasyList / Adblock Plus filters and are documented in more detail in thedeclarativeNetRequest documentation.
The underlying goal of the index format is to efficiently check to see if URLs match any URL patterns contained in the index. The data structure used here is an N-gram filter. An N-gram is a string consisting of N (up to 8) bytes. Currently, the component has chosen to usekNGramSize = 5
.
The strategy used in this component is to build a data structure which mapsNGram -> vector<UrlRule>
, by finding all N-grams associated with a given URL pattern, and picking one of them (the most distinctive one, seeUrlPatternIndexBuilder::GetMostDistinctiveNGram
). The URL pattern is then inserted into the map associated with that N-gram.
Note: URL patterns have special characters like*
and^
which implement special wildcard matching. N-grams are built onlybetween these special characters.
For example, the URL patternfoo.com/*abc*
will generate the following 5-grams:
foo.coo.coo.com.com/
Seeurl_pattern_index.fbs for the raw underlying Flatbuffers format which builds the N-gram filter using acustom hash table implementation.
Querying a built index is very similar to building the index in the first place. Given a URL, it is broken into all of it's component N-grams, just like the URL pattern was above. For example, the URLhttps://foo.com/?q=abcdef
would generate the following 5-grams:
httpsttps:tps:/ps://s://f://fo//foo/foo.foo.coo.coo.com.com/com/?om/?qm/?q=/?q=a?q=abq=abc=abcdabcdebcdef
With these N-grams extracted, we can just consider all of theUrlPattern
s which are associated with those N-grams. SeeFindMatchInFlatUrlPatternIndex
andFindMatchAmongCandidates
for this logic.
Many of these N-grams match ones that are also present in thefoo.com/*abc*
example above , so we can be sure that that URL pattern will be considered during pattern evaluation.
You might be thinking “what about URLs whose length is less than N, or patterns that generate no N-grams?” We will make sure to put all rules like that into a special list called thefallback_rules
which are applied to every URL unconditionally.
UrlPattern
This logic is encapsulated inUrlPattern::MatchesUrl
. This essentially consists of splitting a URL pattern by the*
wildcard, and considering each subpattern in between the*
s.
There is some complexity here to deal with:
^
separator matching, which matches any ASCII symbol except letters, digits, and the following:'_', '-', '.', '%'
. Seefuzzy_pattern_matching.|
Left/right anchors, which specifies the beginning or end of a URL.||
Domain anchors, which specifies the start of a (sub-)domain of a URL.After all this complexity is dealt with, the bulk of the subpattern logic is simplyStringPiece::find / std::search
! This component used to use something much more complicated (Knuth-Morris-Pratt algorithm), but benchmarking on real URLs proved the simple solution was more optimal (and removed the need for a preprocessing step at indexing time), so it wasremoved.
For example, in checking ifhttps://foo.com/?q=abcdef
matchesfoo.com/*abc*
, the component will:
foo.com/
andabc
.foo.com/
inhttps://foo.com/?q=abcdef
, which is a match!abc
in?q=abcdef
, which is a match! This is the last pattern, so return true