Copyright ©1998W3C (MIT,INRIA,Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rulesapply. Your interactions with this site are in accordance with ourpublic and Member privacystatements.
This document is a W3C Note reporting on the results of the HTTP-NG WebCharacterization Group and the structure of the Web Characterization Activity.The work which was part of theW3CHTTP-NG Activity, phase I, is now continued in theWeb Characterization Activity.
Review comments on this document should be sent to <www-wca@w3.org> which is thearchived email listfor theWeb Characterization Activity. Information on howto subscribe to public W3C email lists can be found atthe subscription request page.
This document is a NOTE made available by the W3C for discussion only.This indicates no endorsement of its content, nor that the Consortium has, is,or will be allocating any resources to the issues addressed by thisNOTE.
This document describes the experiences and results that came out of the WebCharacterization Group as part of the W3C HTTP-NG Activity, and how that workis now continued in the Web Characterization Activity.
The HTTP-NG Working Group created a series of scenarios for the HTTP-NGprotocol design group, which were implemented in the scope of the HTTP-NGtestbed, and used to optimize its design.
The WCA started in November 1998, and will bring that work model to a wideraudience.
Web Characterization is concerned with looking at the overall patterns of Webstructure and usage by measuring such aspects as server access patterns, thekind of data being accessed, bytes transferred, popularity of resources, etc.By better understanding the dynamics of the Web and how it grows we believethat W3C and the Web Community in general will be better suited to evolve theWeb and to ensure its long term interoperability and robustness.
The purpose of the Activity is to define and implement a scalable mechanismfor gathering data, boiling it down and to presenting it in efficient ways tocontent providers, service providers, user groups, researchers and technologydesigners and other groups.
The information used to characterize the Web is strictly concerned withgeneral patterns of Web usage and does not focus on specific users or Websites. The scope of this Activity is to characterize the Web as a distributedsystem and not on an individual basis.
The HTTP-NG Web Characterization Group was chartered in August 1997 as a partof the HTTP-NG Activity. Its intent was to create a stable and comprehensiveplatform of knowledge and analysis of the Web, to enable the protocoldesigners to create a relevant and well-instructed solution. Previously,analysis of user behavior on the Web has often been based on spurious data,gathered in an ad-hoc manner. The HTTP-NG Web Characterization Group was anattempt at rectifying this.
It was set up to fulfill four primary goals:
The group consisted of members from Boston Universities Ocean group, HarvardColleges Vino group, INRIA, Microsoft, Netscape, Virginia Techs NetworkResource Group, and Xerox Parcs Webology group. Jim Pitkow, Xerox Parc,chaired the group.
The HTTP-NG WCG has leveraged and helped focus existing research programs,which the group considers one of its major accomplishments.
During its charter, the group has responded to the questions of the HTTP-NGProtocol Design Group. This has been influential in the design of the HTTP-NGprotocol. It has also created the HTTP-NG testbed, which operates by usingSURGE (Scalable URL Generator) from Boston University Ocean Group. Scenarioparameters derived from observed statistical regularities in the distributionof file sizes, reading times, and other metrics, were used to simulateclient traffic in the testbed. SURGE used some aspects of Web traffic whichwere not taken into account by then current traffic generators.
Status | Date accomplished | Deliverable |
---|---|---|
Done | Oct. 2-3, 1997 | First face-to-face meeting |
Done | Nov. 1, 1997 | Identification of classification parameters for Web categorization |
Done | Dec. 8, 1997 | Plan for response to HTTP-NG Protocol Design Group questions |
Done | Dec. 31, 1997 | Initial response to HTTP-NG PDG questions |
Done | Feb. 7, 1998 | Final response to HTTP-NG PDG questions |
Done | March-April 1998 | Trace analysis for scenario building, refined testbed software |
Done | April 24, 1998 | Extended scenarios, refined testbed software |
Moved to WCA | Definition of new log file format | |
Moved to WCA | Recommendations for automatic re-sampling | |
Done | June 24, 1998 | Project evaluation |
The group has completed all the original requirements, with the exception ofthe redesign of the Common Log File Format and the recommendations forautomatic re-sampling of the Web, which has been moved to the WebCharacterization Activity.
The W3C Web Characterization Activity was started in November 1998 with aworkshop, gathering some 50 persons interested in the subject. Subsequently, aworking group and an interest group has been started.
The purpose of the Activity is to define and implement a scalable mechanismfor gathering data, boiling it down and to presenting it in efficient ways tocontent providers, service providers, user groups, researchers and technologydesigners and other groups.
The information used to characterize the Web is strictly concerned withgeneral patterns of Web usage and does not focus on specific users or Websites. The scope of this Activity is to characterize the Web as a distributedsystem and not on an individual basis.
The Web Characterization Group in the HTTP-NG Activity was a first phase inthis project. It was completed in August 1998, and phase 2 begun. Its focus isto extend the Web Characterization work and to create an active knowledge basecontaining up-to-date information about the Web by broaden the scope of Webcharacterization, and providing information and test scenarios for the W3CMembership and the Web community in general about the Web and its use, bothnow and in the near future.
An important result of WCG is the identification of the three key groups inthe characterization work and how they interact:
The Bulk Data Providers are typically server maintainers and ISPs providingserver and proxy logs but can also be backbone providers gathering informationdirectly from the Net or users running instrumented Web clients etc. Becauseof privacy concerns and because of the sheer size of log files, it is oftenpreferred to have data providers running a set of characterization toolslocally so that only the boiled down data sets and profiles are released.
The WCG develops and maintains a set of characterization tools used by thedata providers and defines the mechanism for exchanging boiled down data setsand profiles with the data providers in order to maintain confidentiality andtrust. The collected data sets are used to develop characterization models andto provide characterization data to the third group, the reduced dataconsumers.
The reduced data consumers use the profiles and data sets provided by the WCGand provide feedback and new questions to be asked. Primary data consumers areexpected to be content providers, service providers, user groups, researchersand technology designers.
The format for this Activity is to let the interaction between the reduceddata consumers and bulk data providers take place through an Interest Group,with a new Web Characterization Working Group (WCG) functioning as themediator, provider of analysis tools and disseminator of characterizationinformation.
The role of the Interest Group is to be a discussion forum for bulk dataproviders and reduced data consumers, and to provide requests and feedback tothe Working Group. It is expected that the tools and dissemination mechanismproduced by the Working Group will benefit from a feedback mechanism with itsimmediate users, as well as their continuous review. All work will bediscussed on the Web Characterization Activity Forum.
Participation in the Interest Group is open to everybody.
The Activity was kicked off by the Web Characterization Workshop, November 5,1998 in Boston, MA, with the intent of bringing together both W3C Members andWeb characterization experts. As a results of the Workshop, the Interest Groupwas formed, and several organizations who wanted to participate in the WorkingGroup were identified.
The WCG is intended to work using a request/response based model similar tothe one used in the HTTP-NG Activity. Requests will be formally issuedby the Interest Group and by W3C Activities and the WCG will respond withrealistic time lines for when and how results can be made available.
The WCG will start its work by formally soliciting requests forcharacterization data needed by other W3C Working Groups and Activities. Thesolicitation process is intended to occur at six-month intervals, enough timefor the Working Group to understand and respond to the requests of the otherW3C Groups. Requests from the Interest Group will be dealt with on a case bycase basis. All work will be discussed on the Web Characterization ActivityForum.
The working group has the following participants:
Name | Affiliation | Function in the WCA |
---|---|---|
Marc Abrams | Virginia tech | |
Martin F. Arlitt | HP Labs | |
Paul Barford | Boston University | |
Pei Cao | University of Wisconsin | |
Anja Feldmann | AT&T Research Labs | |
Edward A. Fox | Virginia Tech | |
Johan Hjelm | Ericsson/W3C | Interest Group Chair |
Balachander Krishnamurthy | AT&T Research Labs | |
Jim Gettys | W3C/Compaq | |
Joe Meadows | Boeing | |
Henrik Frystyk Nielsen | W3C | W3C Staff Contact |
Ed O'Neill | OCLC | |
Jim Pitkow | Xerox PARC | Working Group Chair |
Further information about the work in progress can be found at theWeb Characterization Activity Home Page
The following are examples of some of the findings of the HTTP-NG WCG andother researchers in the field of Web Characterization. This is by no meansmeant to be neither a complete listing of the findings of the HTTP-NG WCG, nora representative sample of research in the field. Rather it contains resultsthat the group found provocative and representative of the types of questionsthe HTTP-NG WCG found to be of interest.
Alexa Internet and WCG Analysis of AOLData - December 1997
Source:W3C, Mark Gray,Netcraft Server Survey
The HTTP-NG testbed was designed for the specific purpose of making reliableand convincing claims that the performance of HTTP-NG would be comparable toprior HTTP implementations. It was designed in close cooperation with theHTTP-NG Protocol Design Group.
An analysis of the current practice in load generation tools left the HTTP-NGWCG concerned with the representativeness of the traffic being generated.
Essentially, three types of traffic generation models exist: Stress testing,trace replay, and statistically derived models. Many current trafficgenerators follow the first model, by varying the number of requests persecond that are issued to the server. While this approach does test thecapacity of the server as measured by the number of HTTP operations persecond, it does not produce traffic patterns that have actually beenobserved.
The second model for traffic generation utilizes packet traces collected fromvarious servers and protocol analyzers. If this method had been used in thetest bed, the group would have had to acquire traces from representativeservers. Apart from determining what is representative, it also presents theproblem of which servers to include, and obtain permission to use their logfile information. Each Web site will also need to be recreated, due to e.g.the effect of the file system configuration on performance.
Consequently, the group selected to statistically model HTTP traffic. Theusers were segmented into three strata: Corporate users, ISP users, andeducational users. To create models for the behavior of each strata, the groupobtained full log files from America Online (major ISP), AltaVista (searchengine/mixed user group), and Boston University (educational users). FromMicrosoft (Corporate usage) a distribution of usage was obtained. All datasets except for the AltaVista data were used to generate scenarios for thetestbed. The log file analysis tools used were based on the prior work of thegroup members, and the personal connections of the group members wereinstrumental in obtaining these data sets.
The HTTP-NG testbed is designed as the diagram below shows:
The HTTP-NG testbed was thus able to take both network characteristics anduser behavior into account, inserting a simulated network between the robotsimulating the client and the server. The statistical traffic generator takesa set of parameters to create a mock server with the associated file system,and a set of simulated clients that make statistically based requests forfiles.
The model characterizes sites as containing Web pages with embedded media andWeb pages without embedded media. Using a model that characterizes pages,rather than just objects, makes alteration in the composition of sites easier.This facilitates determining the effect of new technologies, like CascadingStyle Sheets (CSS).
Throughout the year of the WCG's existence, various group members havecontributed papers, articles, and presentations to the group and the Webcharacterization community. Given the limited focus of the HTTP-NG projecteffort, it is not surprising that these items are focused on characterizationsand representative testbed designs.
Author(s) | Papers, Articles, Notes | Date Published |
---|---|---|
Jim Pitkow | W3C Note: HTTP-NG WCG StatusReport | July 1998 |
Jim Pitkow | Summaryof WWW Characterizations Paper at WWW7 | April 1998 |
Huberman, Pirolli, Pitkow and Lukose | Strong Regularitiesin World Wide Web Surfing(PDF format) | April, 1998 |
Barford and Crovella | GeneratingRepresentative Web Workloads for Network and Server PerformanceEvaluation(Postscript format) | November, 1997 |
Manley, Courage and Seltzer | ASelf-Scaling and Self-Configuring Benchmark for Web Servers | November, 1997 |
Manley and Seltzer | Web Fact andFantasy | October, 1997 |
Abdulla, Fox and Abrams | Shared User Behavior onthe World Wide Web | October, 1997 |
The group has achieved its objectives, creating feedback for the HTTP-NGProtocol Design Group by answering the questions this group had about the Web,and by creating the HTTP-NG testbed, which enabled the creation of anoptimized and efficient design of the next generation of the HypertextTransfer Protocol. The Web characterization work is now being continued in theWeb Characterization Activity.