Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Nov 1, 2022. It is now read-only.

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

License

NotificationsYou must be signed in to change notification settings

unblocked-web/double-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


NOTICE📝 This module is merged into the unblocked monorepo for future development!


Double Agent is a suite of tools written to allow a scraper engine to test if it is detectable when trying to blend into the most common web traffic.

DoubleAgent has been organized into two main layers:

  • /collect: scripts/plugins for collecting browser profiles. Each plugin generates a series of pages to test how a browser behaves.
  • /analyze: scripts/plugins for analyzing browser profiles against verified profiles. Scraper results fromcollect are compared to legit "profiles" to find discrepancies. These checks are given a Looks Human"™ score, which indicates the likelihood that a scraper would be flagged as bot or human.

The easiest way to usecollect is with the collect-controller:

  • /collect-controller: a server that can generate step-by-step assignments for a scraper to run all tests

Plugins

The bulk of thecollect andanalyze logic has been organized into what we call plugins.

Collect Plugins

NameDescription
browser-codecsCollects the audio, video and WebRTC codecs of the browser
browser-dom-environmentCollects the browser's DOM environment such as object structure, class inheritance amd key order
browser-fingerprintsCollects various browser attributes that can be used to fingerprint a given session
browser-fontsCollects the fonts of the current browser/os.
browser-speechCollects browser speech synthesis voices
http-assetsCollects the headers used when loading assets such as css, js, and images in a browser
http-basic-headersCollects the headers sent by browser when requesting documents in various contexts
http-ua-hintsCollects User Agent hints for a browser
http-websocketsCollects the headers used when initializing and facilitating web sockets
http-xhrCollects the headers used by browsers when facilitating XHR requests
http2-sessionCollects the settings, pings and frames sent across by a browser http2 client
tcpCollects tcp packet values such as window-size and time-to-live
tls-clienthelloCollects the TLS clienthello handshake when initiating a secure connection
http-basic-cookiesCollects a wide range of cookies configuration options and whether they're settable/gettable

Analyze Plugins

NameDescription
browser-codecsAnalyzes that the audio, video and WebRTC codecs match the given user agent
browser-dom-environmentAnalyzes the DOM environment, such as functionality and object structure, match the given user-agent
browser-fingerprintsAnalyzes whether the browser's fingerprints leak across sessions
http-assetsAnalyzes http header order, capitalization and default values for common document assets (images, fonts, media, scripts, stylesheet, etc)
http-basic-cookiesAnalyzes whether cookies are enabled correctly, including same-site and secure
http-basic-headersAnalyzes header order, capitalization and default values
http-websocketsAnalyzes websocket upgrade request header order, capitalization and default values
http-xhrAnalyzes header order, capitalization and default values of XHR requests
http2-sessionAnalyzes http2 session settings and frames
tcpAnalyzes tcp packet values, including window-size and time-to-live
tls-clienthelloAnalyzes clienthello handshake signatures, including ciphers, extensions and version

Probes:

DoubleAgent operates off of the notion of "probes". Probes are checks or "tests" to reliably check a piece of information emitted by a browser. Thecollect phase of DoubleAgent gathers raw data from browsers running a series of tests. Theanalyze phase turns that raw data into "probes" using these patterns.

Each measured "signal" from a browser is stored as aprobe-id, which is the raw output of the actual values emitted.

Probes are created during "Profile Generation", which will create all the possibleprobe-ids, along with which browsers and operating systems they correspond to. These are called "Probe Buckets". They're a tool to find overlap between the millions of signals browsers put out and reduce the noise when presenting the information.

{"id":"aord-accv","checkName":"ArrayOrderIndexCheck","checkType":"Individual","checkMeta": {"path":"headers:none:Document:host","protocol":"http","httpMethod":"GET"  },"args": [    [      [],      ["connection","upgrade-insecure-requests","user-agent","accept","accept-encoding","accept-language","cookie"      ]    ]  ]}

Probe ids for that pattern look like:http:GET:headers:none:Document:host:ArrayOrderIndexCheck:;connection,upgrade-insecure-requests,user-agent,accept,accept-encoding,accept-language,cookie. This probe id captures a bit about the test, as well as the measured signal from the browser.

Updating the Probe "Sources"

Probes are generated from a baseline of browsers. Double Agent comes with some built-in profiles inprobe-data based on the browsershere. Double Agent is built to allow testing a single browser, or to generate a massive data set to see how well scrapers can emulate many browsers. As this is very time consuming, we tend to limit the tested browsers to the last couple versions of Chrome, which is what Unblocked Agent can currently emulate.

If you wish to generate probes for different data browsers (or a wider set), you can follow these steps to update the data:

  1. Clone theunblocked-web/unblocked monorepo and install git submodules.
  2. Download theunblocked-web/browser-profiler data by runningyarn downloadData in that workspace folder.
  3. Modifydouble-agent/stacks/data/external/userAgentConfig.json to include browser ids you wish to test (<browser.toLowercase()>-<major>-<minor ?? 0>).
  4. Runyarn 0 to copy in the profile data.
  5. Runyarn 1 to create new probes.

Testing your Scraper:

To view examples of running the test suite with a custom browser, check-out theDoubleAgent Stacks project inUnblocked.

About

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp