Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done somefurther analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.
Five hundred years ago, it was the ambition of European explorers to follow in the wake of Christopher Columbus. Today anyone with enough time and money can take several months off from the daily grind to sail around the globe. Around-the-world cruises are available on some of the world's most famous cruise lines.
Planning a cheap road trip is not only easy, it adds a new dimension to your adventure. Whether you want to head out for a day, a weekend or a week-long vacation, you do not need to break the bank to have a good time. A good trip is really more about the journey and the company you keep and less about the destination.
The Florida Keys are a group of 1,700 islands off the southeastern edge of Florida, south of Miami. They are a popular destination for travelers looking for blue skies, sunshine and blue-water beaches, as well as diving, snorkeling, swimming and fishing. The tropical climate and natural beauty of Florida...
Peru's rugged and varied landscape makes a rewarding backdrop for intense walking and trekking adventures. If your idea of a holiday well-spent involves navigating through unspoiled terrain and generally getting off the grid, there are plenty of options awaiting you in Peru.
The Bahamas has more beachfront than any other nation in the Caribbean Sea or Atlantic Ocean. With more than 800 miles of public beaches, a vacation there is sure to involve plenty of sun, sand and maybe a couple of setbacks. Ensure that you have a relaxing trip by packing the proper items for your Bahamian vacation.
Getting a bargain on plane fare may seem like trying to pull a rabbit from a magician's hat, but more strategy is involved than magic. Flexibility and understanding the reasoning airlines use in setting fares also helps. With a bit of advance planning, it's possible to purchase a seat at the lowest possible price.
All-inclusive vacations are the epitome of relaxing, worry-free travel because there's no need to keep track of your spending -- or your wallet, for that matter -- at the resort. You can eat, drink and play to your heart's content. Having so many complimentary amenities at your disposal can cause confusion while ...