Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
Whats in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexas top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available warts and all for people to experiment with. We have also done somefurther analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what youre hoping to do with it. We may not be able to say yes to all requests, since were just figuring out whether this is a good idea, but everyone will be considered.
![]() |
was created over lunch at the home of Tanayo, once billed as in December of 1994. Handing me an old address book, with scratched out faded names, many of which were stage names, she asked if I could help find old friends from her days on the burlesque stage. Three weeks later I accepted her challenge and the Burlesque Historical Society began to take shape. during her career, did everything in her power to keep the dancers connected over the years. Jennie passed away in 1990, long before this group was created, but her interest in burlesque, its history and its people, has been a great help and inspiration to me over the years. was slow at first, and it still comes and goes in spurts. I suspect it may always be hard for me to comprehend how many thousands of people worked the various stages of burlesque, whether it was in theaters, clubs or carnivals. It took so many people to put on a show. (date negotiable)... pleasesend me their information. Everybody is looking for somebody! is all about is quite simple. We re-connect old friends who worked in burlesque with one another, and share information. In the past, several Reunions have been held in both California and Las Vegas. Will there be more Reunions? I cant say for sure. What seems to be most important to the people in the group is that they receive the newsletters that are put out four times a year and that they can reconnect and stay in touch with old friends. What is most important to the Burlesque Historical Society is that the history of old time Burlesque is preserved.
|