Movatterモバイル変換

06 May 2011 - 08 Feb 2026

Jun	JUL	Aug
	27
2010	2011	2012

success

fail

About this capture

COLLECTED BY

Organization:Internet Archive

The Internet Archive discovers and captures web pages through many different web crawls.At any given time several distinct crawls are running, some for months, and some every day or longer.View the web archive through theWayback Machine.

Collection:Wide Crawl started March 2011

Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.

Whatâs in the data set:

Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexaâs top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.

However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.

We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available âwarts and allâ for people to experiment with. We have also done somefurther analysis of the content.

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what youâre hoping to do with it. We may not be able to say âyesâ to all requests, since weâre just figuring out whether this is a good idea, but everyone will be considered.

TIMESTAMPS

Search Guidelines Contact Us Members Only

		Home >News >DOI^® News - April 2011 DOI^® News April 2011 DOI^® News is a public news release. Information contained within this newsletter may be reproduced and disseminated to all interested parties. In this issue: DOI System exceeds 50 million assigned identifiers DOI System and Linked Data Revised DOI charging model 1. DOI System exceeds 50 million assigned identifiers As at April 2011 the DOI System has now assigned over 51 Million DOI Names. The DOI System currently is used by over 4,000 naming authorities (assigners). Around 100 million DOI resolutions are made each month. 2. DOI System and Linked Data Digital Object Identifiers assigned by CrossRef (www.crossref.org) are now enabled for use in linked data applications. The term "linked data" describes a set of best practices for exposing data in machine-readable form using the standard HTTP web protocol. These best practices support the development of tools to link and make use of data from multiple web sources without the need to deal with many different proprietary and incompatible application programming interfaces (APIs). A significant advantage of applying Linked Data principles and technologies to DOI-registered material is that it is 'data worth linking to': it is curated, value-added, data, which is managed, corrected, updated and consistently maintained by Registration Agencies. It is also persistent, so avoiding 'bit-rot'. The DOI web proxy (http://dx.doi.org) is now enabled to support content negotiation for DOI names. In the early days of the web, human beings were following most URLs, and it made sense that the DOI web proxy only resolved CrossRef DOI names to human-readable web pages. This announcement by CrossRef is part of improvements the International DOI Foundation is continuing to make to facilitate more sophisticated uses of a DOI name beyond single redirection to a human-readable landing page, including Linked Data (machine-readable metadata in the form of RDF); delivery of information in other formats (XML, etc.); and multiple typing (multiple URLs, other non-URL types to express semantic relationships, etc. using mapping technologies of the Vocabulary Mapping Framework (VMF). We will be making further announcements on some of these later this year. For further information see: CrossRef Linked Data Announcement and Examples Vocabulary Mapping Framework (VMF) 3. Revised DOI cost-sharing model The International DOI Foundation has announced a change in the way in which DOI Registration Agencies fund the common infrastructure of the DOI System. The DOI System is a cost-recovery system. The cost of common DOI infrastructure (run by the International DOI Foundation on behalf of all DOI Registration Agencies) is met by a charge made to each Registration Agency, whilst allowing each Registration Agency to adopt individual commercial models incorporating DOI registration for their services. As of June 2011, the DOI System will adopt a revised model for this charge, transitioning from the current financial model (a charge per DOI name registered) to a revised model (based on a fixed fee per Registration Agency). The introduction of this new system will result in lower long term charges, and has been made possible through the successful growth of the DOI System; it is designed to encourage further growth. Registration Agencies remain free to adopt their own charging model for their individual value-added services. For further information, contact contact@doi.org. The DOI is a system for interoperably identifying and exchanging intellectual property in the digital environment. A DOI assigned to content enhances a content producer's ability to trade electronically. It provides a framework for managing content in any form at any level of granularity, for linking customers with content suppliers, for facilitating electronic commerce, and enabling automated copyright management for all types of media. The International DOI Foundation, a non-profit organization, manages development, policy and licensing of the DOI to registration agencies and technology providers and advises on usage and development of related services and technologies. The DOI system uses open standards with a standard syntax (ANSI/NISO Z39.84) and is currently used by leading international technology and content organizations. This is a service announcement for the International DOI Foundation and has been prepared to inform you of developments to enable digital copyright management of intellectual property. For more information, please send your request to contact@doi.org. Prepared 20 April 2011