Movatterモバイル変換

[Mirrored from:http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html December 12, 1996]

XML, Java, and the future of the Web

Jon Bosak, Sun Microsystems
Last revised 1996.11.25

Introduction

The extraordinary growth of the World Wide Web has been fueled bythe ability it gives authors to easily and cheaply distributeelectronic documents to an international audience. As Web documentshave become larger and more complex, however, Web content providershave begun to experience the limitations of a medium that does notprovide the extensibility, structure, and data checking needed forlarge-scale commercial publishing. The ability of Java applets toembed powerful data manipulation capabilities in Web clients makeseven clearer the limitations of current methods for the transmittal ofdocument data.

To address the requirements of commercial Web publishing and enablethe further expansion of Web technology into new domains ofdistributed document processing, the World Wide Web Consortium hasdeveloped an Extensible Markup Language (XML) for applications thatrequire functionality beyond the current Hypertext Markup Language(HTML). This paper describes the XML effort and discusses new kindsof Java-based Web applications made possible by XML.

Background: HTML and SGML

Most documents on the Web are stored and transmitted in HTML. HTMLis a simple language well suited for hypertext, multimedia, and thedisplay of small and reasonably simple documents. HTML is based onSGML (Standard Generalized Markup Language, ISO 8879), a standard systemfor defining and using document formats.

SGML allows documents to describe their own grammar -- that is, tospecify the tag set used in the document and the structuralrelationships that those tags represent. HTML applications areapplications that hardwire a small set of tags in conformance with asingle SGML specification. Freezing a small set of tags allows usersto leave the language specification out of the document and makes itmuch easier to build applications, but this ease comes at the cost ofseverely limiting HTML in several important respects, chief amongwhich are extensibility, structure, and validation.

Extensibility. HTML does not allow users to specify theirown tags or attributes in order to parameterize or otherwisesemantically qualify their data.
Structure. HTML does not support the specification of deepstructures needed to represent database schemas or object-orientedhierarchies.
Validation. HTML does not support the kind of languagespecification that allows consuming applications to check data forstructural validity on importation.

In contrast to HTML stands generic SGML. A generic SGMLapplication is one that supports SGML language specifications ofarbitrary complexity and makes possible the qualities ofextensibility, structure, and validation missing from HTML. SGMLmakes it possible to define your own formats for your own documents,to handle large and complex documents, and to manage large informationrepositories. However, full SGML contains many optional features thatare not needed for Web applications and has proven to have acost/benefit ratio unattractive to current vendors of Web browsers.

The XML effort

The World Wide Web Consortium (W3C) has created an SGML EditorialReview Board and an SGML Working Group to build a set ofspecifications to make it easy and straightforward to use thebeneficial features of SGML on the Web. See theW3C SGMLActivity page [1] for the current status of this effort. The goalof the W3C SGML activity is to enable the delivery of self-describingdata structures of arbitrary depth and complexity to applications thatrequire such structures.

The first phase of this effort is the specification of a simplifiedsubset of SGML specially designed for Web applications. This subset,called XML (Extensible Markup Language), retains the key SGMLadvantages of extensibility, structure, and validation in a languagethat is designed to be vastly easier to learn, use, and implement thanfull SGML.

XML differs from HTML in three major respects:

Information providers can define new tag and attribute names atwill.
Document structures can be nested to any level of complexity.
Any XML document can contain an optional description of its grammarfor use by applications that need to perform structural validation.

XML has been designed for maximum expressive power, maximumteachability, and maximum ease of implementation. The language is notbackward-compatible with existing HTML documents, but documentsconforming to the W3C HTML 3.2 specification can easily be convertedto XML, as can generic SGML documents and documents generated fromdatabases.

Aninitial workingdraft for XML 1.0 [2] has been released for public discussion. Acomplete specification that includes methods for associating hypertextlinking and stylesheet mechanisms with XML documents is scheduled forrelease at the Sixth World Wide Web Conference in April, 1997.

Web applications of XML

The applications that will drive the acceptance of XML are thosethat cannot be accomplished within the limitations of HTML. Theseapplications can be divided into four broad categories:

Applications that require the Web client to mediate between two ormore heterogeneous databases.
Applications that attempt to distribute a significant proportionof the processing load from the Web server to the Web client.
Applications that require the Web client to present differentviews of the same data to different users.
Applications in which intelligent Web agents attempt to tailorinformation discovery to the needs of individual users.

The alternative to XML for these applications is proprietary codeembedded as "script elements" in HTML documents and delivered inconjunction with proprietary browser plug-ins or Java applets. XMLderives from a philosophy that data belongs to its creators and thatcontent providers are best served by a data format that does not bindthem to particular script languages, authoring tools, and deliveryengines but provides a standardized, vendor-independent, level playingfield upon which different authoring and delivery tools may freelycompete.

Database interchange: the universal hub

A paradigmatic example of this first category of XML applicationsis the information tracking system for a home health care agency.

Home health care is a major component of America'smultibillion-dollar medical industry that continues to increase inimportance as the health care burden is shifted from hospitals to homecare settings. Information management is critical to this industry inorder to meet the record-keeping requirements of the federal agenciesand health maintenance organizations that pay for patient care.

The typical patient entering a home health care agency isrepresented to the information system by a large collection ofpaper-based historical materials in the form of patient medicalhistories and billing data from a variety of doctors, hospitals,pharmacies, and insurance companies. The biggest task in getting thepatient into the system is the manual entry of this material into theagency's database.

The coming of the Web has given the medical informatics communitythe hope that an electronic means can be found to alleviate thisburden. Unfortunately, existing Web applications representfundamentally insufficient models for an adequate solution. Hospitalshave begun to offer the agencies a solution that goes something likethis:

Log into the hospital's Web site.
Become an authorized user.
Access the patient's medical records using a Web browser.
Print out the records from the browser.
Manually key in the data from the printouts.

The knowledgeable reader may smile at this "solution," but in factthis is not a joke; this is an actual proposal from a large Americanhospital known for its early adoption of advanced medical informationsystems.

A slightly more sophisticated version of this "solution" envisionsthe operator reading the patient data from the Web browser and keyingit directly into the agency's online forms-based interface in aseparate window instead of making a printout first. The onlydifference between this version and the previous one is that it savesthe paper that would have been needed for the printout. It doesnothing to address the root of the problem. A real solution wouldlook more like this:

Log into the hospital's Web site.
Become an authorized user.
Access the patient's medical records in a Web-based interface thatrepresents the records for that patient with a folder icon.
Drag the folder from the Web application over to the internaldatabase application.
Drop it into the database.

However, this solution is not possible within the limitations ofHTML, for three reasons.

The HTML tag set is too limited to represent or differentiatebetween the multitude of database fields in the mixture of documentsmaking up the patient's medical history.
HTML is incapable of representing the variety of structures inthose documents.
HTML lacks any mechanism for checking the data for structuralvalidity before the receiving application attempts to import it intothe target database.

One technically feasible way to implement seamless interchange ofpatient care records is simply to require all hospitals and healthcare agencies to use a single standard system dictated by thegovernment (such an approach has actually been suggested). In anenvironment where hospitals are going out of business on a daily basisand many health care agencies are in deep financial difficulty,however, a scheme that would require them to replace their existingheterogeneous systems with a single new systemen masse ishardly practical.

The other way to enable interchange between heterogeneous systemsis to adopt a single industry-wide interchange format that serves asthe single output format for all exporting systems and the singleinput format for all importing systems. This is, in fact, the purposefor which SGML was initially designed, and XML simply carries on thistradition.

A number of industries, including the aerospace, automotive,telecommunications, and computer software industries, have been usinghub languages to perform data interchange for years, and by this timethe process is well understood. Typically, the major players in anindustry form a standards consortium tasked with defining a DocumentType Definition, which is the way in which the tag set and grammar ofa markup language are defined. This DTD can then be sent withdocuments that have been marked up in the industry standard languageusing off-the-shelf editing tools, and any standard application on thereceiving end can validate and process them.

The XML solution is system-independent, vendor-independent, andproven by over a decade of SGML implementation experience. XML merelyextends this proven approach to document interchange over the Web.Interestingly, the same day on which the first XML 1.0 draft wasreleased also saw the formal announcement of an initiative spearheadedby the University of Southern California Medical Center, ScrippsInstitute, and the Rand Corporation to develop a Health Care MarkupLanguage designed to solve exactly the kind of problem described inthis example.

Previous vertical-industry efforts have shown that capturing datain a rich markup often has benefits beyond the immediate requirementsof data exchange. In a well-designed standardized patient datasystem, for example, specific information originally gathered in thecourse of a routine physical exam and tagged <allergies>,<drug-reactions>, and so on would instantly be available toalert the staff of an emergency room that an unconscious patient froma distant city was allergic to penicillin. The ability of XML todefine tags specific to an area of application is critical to thisscenario, because the otherwise unqualified word "penicillin" in thethousands of pages of a patient's entire medical history could nottrigger the recognition that the same word inside an <allergies>element could trigger.

The health care example is relevant not only because of the scopeof the problem and the enormous sums of money involved but alsobecause it is paradigmatic of a very wide range of future Webapplications -- any in which Web clients (or Java applications runningon those clients) are expected to mediate the lossless exchange ofcomplex data between systems that use different forms of datarepresentation in a way that can be standardized across an industry orother interest group. Some random examples of such applications are:

Legal publishing
The government drug approval process
Collaborative CAD/CAM efforts
Collaborative calendar management across different systems
Any corporate network application that works across databases,especially where policies must be enforced: purchase orders, expenserequests, etc.
Exchange of information between players in any broker-organizedbusiness: insurance, securities, banking, etc.

Distributed processing: giving Java something to do

A paradigmatic example of this second category of XML applicationsis the data delivery system designed by the semiconductor industry.

Each major semiconductor manufacturer maintains several terabytesof technical data on all of the ICs that it produces. To enableinterchange of this data, an industry consortium (the Pinnacles Group)was formed several years ago by Intel, National Semiconductor,Philips, Texas Instruments, and Hitachi to design an industry-specificSGML markup language. The consortium finished that specification in1995, and its member companies are now well into the implementationphase of the process.

One might think that the rise in popularity of HTML would cause thePinnacles members to reconsider their decision, but in fact thelimitations of HTML have convinced them that their original strategywas the correct one. Their initial idea was that the richlyparameterized data stream made possible by the industry-specific SGMLmarkup would enable intelligent applications not merely to displaysemiconductor data sheets as readable documents but actually to drivedesign processes. It is now recognized that this approach is aperfect fit with the concept of distributed Java applets, and thevision of the near future is one in which engineers can access amanufacturer's Web site and download not only viewable data onparticular integrated circuits but also a Java applet that allows themto model those circuits in various combinations.

The semiconductor application is a good demonstration of theadvantages of XML because:

It requires industry-specific markup that cannot be implementedwithin the confines of the fixed HTML tag set.
It requires that the data representation be platform- andvendor-independent so that data from a variety of sources can be usedto drive a variety of distributed applications (some of which may beprovided by third parties, generating a subindustry of providers oftools that can work with the standardized data stream).
Its utility rests ultimately in the fact that acomputation-intensive process (modeling circuits for hours at a time)that would otherwise entail an enormous, extended resource hit on theserver has been changed into a brief interaction with the serverfollowed by an extended interaction with the user's own Web client.This aspect has been summed up in the slogan "XML gives Javasomething to do."

Note that validation, while sometimes important, does not alwaysplay the crucial role in this category of applications that it does inapplications where data must be checked for structural integritybefore entering a database. To make processing as efficient aspossible, XML has been designed so that validation is optional inapplications where it is not needed.

As with the health-care example, the semiconductor application isnotable not merely for the sheer size of the market it represents butalso because it is paradigmatic of an enormous range of futureJava-based Web applications -- virtually any application in whichstandardized data is expected to be manipulated in interesting ways onthe client. Perhaps the most obvious examples of such applicationsare the following:

Design applications where the designer would otherwise use servercycles to consider various alternatives: electronics, engineering,architecture, menu planning, etc.
Scheduling applications where a customer would otherwise useserver cycles to entertain various possibilities: airlines, trains,buses, and subways; restaurants, movies, plays, and concerts. This iswhat Easy Saabre and Ticketron will look like a few years from now asthe economies of distributed Java-based processing become evident.
Commercial applications that allow consumers to explorealternatives by supplying different shopping criteria: real estate,automobiles, appliances, etc.
The entire spectrum of educational applications, a small subset ofwhich are the ones we call "online help".
The entire spectrum of customer-support applications, ranging fromlawn-mower maintenance through technical support for computers.

A harbinger of applications to come in the last category is theSolution Exchange Standard, an SGML markup language announced lastJune by a consortium of over 60 hardware, software, and communicationscompanies to facilitate the exchange of technical support informationamong vendors, system integrators, and corporate help desks. In thewords of the announcement:

The standard has been designed to be flexible. It isindependent of any platform, vendor or application, so it can be usedto exchange solution information without regard to the system it iscoming from or going to. [...] Additionally, the standard has beendesigned to have a long lifetime. SGML offers room for growth andextensibility, so the standard can easily accommodate rapidly changingsupport environments.

Such applications, which the XML subset is specifically designed toaddress, will grow in importance as consumers come to expectinteroperability among their data-manipulating applets and informationproviders confront the realities of trying to supportcomputation-intensive tasks directly on their Web servers.

View selection: letting the user decide

A third variety of XML applications are those in which users maywish to switch between different views of the data without requiringthat the data be downloaded again in a different form from the Webserver.

One early application in this category will be dynamic tables ofcontents. It is possible now, using Web servers built onobject-oriented databases, to present the user with a table ofcontents into a large collection of data that can be expanded with amouse click to "open up" a portion of the TOC and reveal more detailedlevels of the document structure. Dynamic TOCs of this kind can begenerated at run time directly from the hierarchical structure of thedocument. Unfortunately, the Web latency built into every expansionor contraction of the TOC makes this process sluggish in many userenvironments. A much better solution is to download the entirestructured TOC to the client rather than just individualserver-generated views of the TOC. Then the user can expand,contract, and move about in the TOC supported by a much faster processrunning directly on the client.

A group at Sun actually implemented a form of this solution as partof a Java-based HTML help browser, but the limitations of HTMLrequired the team to come up with a couple of clever workarounds. Inthis application, a TOC was constructed by hand (the lack of structurein ordinary HTML makes it impossible to reliably generate a TOCdirectly from the document) using nonstandard tags invented for thepurpose, and then the TOC piece was wrapped in a comment within anHTML page to hide the nonstandard markup from Web browsers. A Javaapplet downloaded with the HTML document interpreted the hidden markupand provided the client-based TOC behavior.

In practice, this application worked very well and testified bothto the ingenuity of its designers and to the validity of the basicconcept. But in an XML environment, neither the manual creation ofthe TOC nor its concealment would have been necessary. Instead,standard XML editors would have been used to create structured contentfrom which a structured TOC could be generated at run time anddownloaded to browsers that would automatically create and display theTOC using either a downloaded Java applet or a standard set ofJavaHelp class libraries.

The ability to capture and transmit semantic and structural datamade possible by XML greatly expands the range of possibilities forclient-side manipulation of the way data appears to the user. Forexample:

A technical manual that covers both the Sparc and x86 versions ofthe Solaris operating system can be made to appear like a manual forSparc only, or a manual for x86 only, just by clicking a preferencesswitch.
An installation sheet that carries warnings in multiple languagescan be made to show just the ones in the language selected by theuser.
A document containing many annotations can be switched from a modethat shows only the text, to a mode that shows only the annotations,to a mode that shows both, just by making a menu selection.
A phone book sorted by last name can instantly be changed into aphone book sorted by first name.

This list only hints at the possible uses that creative Webdesigners will find for richly structured data delivered in astandardized way to Web clients.

Web agents: data that knows about me

A future domain for XML applications will arise when intelligentWeb agents begin to make larger demands for structured data than caneasily be conveyed by HTML. Perhaps the earliest applications in thiscategory will be those in which user preferences must be representedin a standard way to mass media providers. The key requirements forsuch applications have been summed up by Matthew Fuchs of DisneyImagineering: "Information needs to know about itself, and informationneeds to know about me."

Consider a personalized TV guide for the fabled 500-channel cableTV system. A personalized TV guide that works across the entirespectrum of possible providers requires not only that the user'spreferences and other characteristics (educational level, interest,profession, age, visual acuity) be specified in a standard,vendor-independent manner -- obviously a job for an industry-standardmarkup system -- but also that the programs themselves be described ina way that allows agents to intelligently select the ones most likelyto be of interest to the user. This second requirement can be metonly by a standardized system that uses many specialized tags toconvey specific attributes of a particular program offering (subjectcategory, audience category, leading actors, length, date made,critical rating, specialized content, language, etc.). Exactly thesame requirements would apply to customized newspapers and many otherapplications in which information selection is tailored to theindvidual user.

While such applications still lie over the horizon, it is obviousthat they will play an increasingly important role in our lives andthat their implementation will require XML-like data in order tofunction interoperably and thereby allow intelligent Web agents tocompete effectively in an open market.

Advanced linking and stylesheet mechanisms

Outside XML as such, but an integral part of the W3C SGML effort,are powerful linking and stylesheet mechanisms that go beyond currentHTML-based methods just as XML goes beyond HTML.

Linking

Despite its name and all of the publicity that has surrounded HTML,this so-called "hypertext markup language" actually implements just atiny amount of the functionality that has historically been associatedwith the concept of hypertext systems. Only the simplest form oflinking is supported -- unidirectional links to hardcoded locations.This is a far cry from the systems that were built and proven duringthe 1970s and 1980s.

In a true hypertext system of the kind envisioned for the XMLeffort, there will be standardized syntax for all of the classichypertext linking mechanisms:

Location-independent naming
Bidirectional links
Links that can be specified and managed outside of documents towhich they apply
N-ary hyperlinks (e.g., rings, multiple windows)
Aggregate links (multiple sources)
Transclusion (the link target document appears to be part of thelink source document)
Attributes on links (link types)

The first draft of a specification for basic standardized hypertextmechanisms to be used in conjunction with XML is scheduled for releaseat the Sixth World Wide Web Conference in April, 1997.

Stylesheets

The current CSS (cascading style sheets) effort provides a stylemechanism well suited to the relatively low-level demands of HTML butincapable of supporting the greatly expanded range of renderingtechniques made possible by extensible structured markup. Thecounterpart to XML is a stylesheet programming language that is:

Freely extensible so that stylesheet designers can define anunlimited number of treatments for an unlimited variety of tags.
Turing-complete so that stylesheet designers can arbitrarilyextend the available procedures.
Based on a standard syntax to minimize the learning curve.
Able to address the entire tree structure of an XML document instructural terms, so that context relationships between elements in adocument can be expressed to any level of complexity.
Completely internationalized so that left-to-right, right-to-left,and top-to-bottom scripts can all be dealt with, even if mixed in asingle document.
Provided with a sophisticated rendering model that allows thespecification of professional page layout features such as multiplecolumn sets, rotated text areas, and float zones.
Defined in a way that allows partial rendering in order to enableefficient delivery of documents over the Web.

Such a language already exists in a new international standardcalled the Document Style Semantics and Specification Language (DSSSL,ISO/IEC 10179). Published in April, 1996, DSSSL is the stylesheetlanguage of the future for XML documents. An initial specification ofaDSSSLsubset [3] for use with XML applications has already beenpublished. This specification will be further developed as part ofthe XML activity.

Conclusion

HTML functions well as a markup for the publication of simpledocuments and as a transportation envelope for downloadable scripts.However, the need to support the much greater information requirementsof standardized Java applications will necessitate the development ofa standard, extensible, structured language and similarly expandedlinking and stylesheet mechanisms. The W3C SGML effort is activelydeveloping a set of specifications that will allow these objectives tobe met within an open standards environment.

Acknowledgements

The author would like to thank his colleagues in the DavenportGroup for early contributions to the beginnings of this document. Theexample applications were clarified and expanded with the help ofparticipants in the workshop "Internet Applications of SGML and DSSSL"held at the GCA Information and Technology Week in Seattle on August23, 1996. Special thanks are due to Tim Bray, Kurt Conrad, SteveDeRose, Matt Fuchs, and Murray Maloney for their outstandingcontributions to the workshop.

Production note

This paper was written in HTML 3.2 and formatted by theJade DSSSL engine [4] forprintout. The section numbers, headers, footers, and Table ofContents seen in the printed version are not part of theHTML source [5]but were generated automatically as specified by aDSSSLstylesheet [6].

References

[1] http://www.w3.org/pub/WWW/MarkUp/SGML/Activity
[2] http://www.w3.org/pub/WWW/TR/WD-xml-961114.html
[3] http://sunsite.unc.edu/pub/sun-info/standards/dsssl/dssslo/dssslo.htm
[4] http://www.jclark.com/jade/
[5] http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm
[6] http://sunsite.unc.edu/pub/sun-info/standards/dsssl/stylesheets/html32/html32hc.dsl

[8]ページ先頭