Movatterモバイル変換


[0]ホーム

URL:


Wayback Machine
97 captures
29 Oct 2009 - 09 Aug 2024
NovMAYJul
Previous capture20Next capture
200920112012
success
fail
COLLECTED BY
Organization:Internet Archive
The Internet Archive discovers and captures web pages through many different web crawls.At any given time several distinct crawls are running, some for months, and some every day or longer.View the web archive through theWayback Machine.

Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.

What’s in the data set:

Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.

However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.

We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done somefurther analysis of the content.

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

TIMESTAMPS
loading
The Wayback Machine - https://web.archive.org/web/20110520163913/http://news.az/mobile

NEWS.AZ

Top Stories
Politics
Economy
Society
Sports
Culture
Tech

Latest Articles

Real reactions to Azerbaijan's Eurovision win
News.Az reprints an article by Nurani from Azerbaijani newspaper Echo.

Azerbaijani -Turkish fraternity eternal - Erdogan
Fraternity between Turkey and Azerbaijan is eternal, said Turkish Premier Recep Tayyip Erdogan while speaking in his election campaign in Kars.

Azerbaijani President meets Jerzy Buzek
Azerbaijani President Ilham Aliyev met Chairman of the European Parliament Jerzy Buzek on Friday.

President inspects development work in Baku villages
President Ilham Aliyev has paid another visit to the villages of Mardakan and Bilgah on the Absheron Peninsula, not far from Baku.

Over 120 criminal gangs nabbed in Baku this year
A total of 122 criminal gangs were charged in the Azerbaijani capital, Baku, in the first four months of 2011.

Neftchi footballer called to Azerbaijan's national squad
Azerbaijani national squad has started a training session in Baku as part of preparations for Euro-2012 qualifier against Kazakhstan and Germany.

More reassurances on Armenian nuke plant
The Armenian nuclear power station is operating normally, Energy Minister Armen Movsisyan has said.

Azerbaijan replenishes gold reserves
The Central Bank of Azerbaijan has received another consignment of gold extracted from Gadabay deposit.

SOCAR, OMV seal memo of mutual understanding
The State Oil Company of Azerbaijan (SOCAR) and Austrian company OMV signed a memorandum of mutual understanding in bilateral cooperation on 18 May.

Baku to introduce long-range ship tracking service
Azerbaijan is to introduce a National Service for the Long-Range Identification and Tracking of Vessels.


Back to Top |News.Az Homepage

[8]ページ先頭

©2009-2025 Movatter.jp