NotificationsYou must be signed in to change notification settings
Fork3.4k
Star34.6k

Creating and viewing WARC web archives in Servo

Alan Jeffrey edited this pageFeb 23, 2018 ·4 revisions

Why?

WARC web archives are a de facto standard for archiving web content. They are the storage format for the Internet ArchiveWayback Machine and supported by theLibrary of Congress.

Servo can make use of web archives for repeatable performance testing against real-world web content, in particular sites which have a large amount of third-party ad content. Web archives can be played using an http proxy which does not have live access to the internet, so provides a stable platform for performance benchmarking.

This note describes how to record and replay web archives, using Servo as the web engine.

This is used by theServo WARC tests to produce adashboard of Servo performance on archived web content.

Installing

Thepywb tools support recording and playback of web archives. They can be run using Python 2 or 3, and can be installed usingpip (inside a virtualenv if preferred):

pip install git+https://github.com/ikreymer/pywb.git

Docs are athttps://pywb.readthedocs.io.

Creating a web archive

To create a web archive, first initializepywb usingwb-manager:

wb-manger init archives

Then start the http server, giving it access to the live internet, and asking it to record and index an archive:

wayback --live --record --autoindex

You can now browse the web, and pages will be recorded in your archives. For example (from the servo build directory):

./mach run -r http://localhost:8080/archives/record/https://nytimes.com/

Once enough of the page is visible, you can quit servo and the wayback server. This should have created an archive file:

ls collections/archives/archiverec-TIMESTAMP-MACHINE.warc.gz

Replaying a web archive

To replay a web archive already in your collection, first start the wayback server:

wayback

Then view the archived content:

./mach run -r http://localhost:8080/archives/https://nytimes.com/

To replay a web archive recorded elsewhere, first add it to your collection:

wb-manager add archives some-other-archive.warc.gz

Replaying a web archive as an http proxy

Replaying archives this way is simple, but has some problems:

It relies on URL rewriting, to add thehttp://localhost:8080/archives/ prefix to any loaded resources, which does not catch everything, in particular URLs dynamically constructed using JavaScript.
Any unwritten URLs will be fetched from the live internet, so not all content is the same between runs.
Since URLs are rewritten to be underlocalhost, they are all considered same-origin by servo, so will all be executed in the same content thread, losing a lot of the benefit of concurrency.

Fortunately,pywb also supports replaying via an HTTP proxy, which removes the need for URL rewriting, since all content is delivered via the proxy.

wayback --proxy archives

Now, as well as serving URL-rewritten content onlocalhost, the wayback server is also acting as an HTTP proxy, serving the original content without any URL-rewriting. Unfortunately there are some steps to get Servo to view this content:

Servo does not have support for HTTP proxies, so needs to be proxified. On Linux this can be done with theproxychains command (installed in Debian-based systems byapt-get install proxychains).
Any https content is served using a certificate signed with a key stored inproxy-certs/pywb-ca.pem. This certificate must be added as a root certificate for Servo.

To run Servo with proxychains, first create aproxychains.conf file:

[ProxyList]http 127.0.0.1 8080

then run:

proxychains ./mach run -r --certificate-path proxy-certs/pywb-ca.pem https://nytimes.com/

This will view the archived content without any URL rewriting.

Unfortunately, there is a gotcha: there are two ways content can be proxied via http: using CONNECT or using GET, and proxychains uses CONNECT but pywb only supports GET. Fortunately, for https content, there is just CONNECT, so this technique works for https content (which these days is most of the web).

At some point, Servo may get native support for HTTP proxies, at which point this should become a non-issue, but for now we're stuck only being able to test https content.

Recording a web archive as an http proxy

To record an archive while running as an http proxy:

wayback --proxy archives --live --proxy-record --autoindex

then run Servo with an http proxy as before.

Movatterモバイル変換

Uh oh!

Creating and viewing WARC web archives in Servo

Why?

Installing

Creating a web archive

Replaying a web archive

Replaying a web archive as an http proxy

Recording a web archive as an http proxy

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!