Movatterモバイル変換

Multimodal Interaction Requirements

W3C NOTE 8 January 2003

This version:: http://www.w3.org/TR/2003/NOTE-mmi-reqs-20030108/
Latest version:: http://www.w3.org/TR/mmi-reqs/
Previous version:: this is the first publication
Editors:: Stéphane H. Maes, Oracle Corporation<stephane.maes@oracle.com>; Vijay Saraswat, Penn State University<saraswat@cse.psu.edu>
Contributors:: SeeAcknowledgements

Abstract

This document describes fundamental requirements for thespecifications under development in the W3CMultimodal InteractionActivity. These requirements were derived from use case studiesas discussed inAppendix A. They have beendeveloped for use by theMultimodal InteractionWorking Group (W3C Membersonly), but may also be relevant to other W3C working groups andrelated external standard activities.

The requirements cover general issues, inputs, outputs,architecture, integration, synchronization points, runtimes anddeployments, but this document does not address application ordeployment conformance rules.

Status of this Document

This section describes the status of this document at thetime of its publication. Other documents may supersede thisdocument. The latest status of this document series is maintainedat theW3C.

W3C's MultimodalInteraction Activity is developing specifications for extendingthe Web to support multiple modes of interaction. This documentdescribes fundamental requirements for multimodal interaction.

This document has been produced as part of theW3C Multimodal InteractionActivity,following the procedures set out for theW3C Process. Theauthors of this document are members of theMultimodal InteractionWorking Group (W3C Membersonly). This is a Royalty Free Working Group, as described inW3C'sCurrentPatent Practice NOTE. Working Group participants are requiredto providepatentdisclosures.

Please send comments about this document to the public mailinglist:www-multimodal@w3.org (publicarchives). To subscribe, send an email to <www-multimodal-request@w3.org>with the wordsubscribe in the subject line (include thewordunsubscribe if you want to unsubscribe).

A list of current W3C Recommendations and other technicaldocuments including Working Drafts and Notes can be found athttp://www.w3.org/TR/.

Introduction

Multimodal interactions extend the Web user interface to allowmultiple modes of interaction, offering users the choice of usingtheir voice, or an input device such as a key pad, keyboard, mouseor stylus. For output, users will be able to listen to spokenprompts and audio, and to view information on graphical displays.This capability for the user to specify the mode or device for aparticular interaction in a particular situation is expected tosignificantly improve the user interface, its accessibility andreliability, especially for mobile applications. The W3C MultimodalInteraction Working Group (WG) is developing markup specificationsfor authoring applications synchronized across multiple modalitiesor devices with a wide range of capabilities.

This document is an internal working draft prepared as part ofthe discussions on multimodal interaction requirements formultimodal interaction specifications.

The work on the present requirement document started from themultimodal requirements for voice markup languages publicworking draft (version 1.0) published by the W3C Voiceactivity[MM Req Voice]. The outline ofthe document remains very similar.

The present requirements scope the nature of the work andspecifications that will be developed by the W3C MultimodalInteraction Working Group (as specified by the charter[MMI Charter]). These intended works may bereferred to below as "specification(s)".

The requirements in this document do not express conformancerules on application, platform runtime implementation ordeployment.

In this document, the following conventions have been followedwhen phrasing the requirements:

"MUST specify": The specifications will address and satisfy therequirement or supporting the features, starting from the firstversion.
"SHOULD specify": The specifications will aim at addressing andsatisfying the requirement or supporting the features during thelifetime of the working group. Early specifications will take thisinto account to allow easy and interoperable updates.
"NICE to specify": The specifications will be designed with therequirement or feature taken into account. If a technical solutionis available, the specifications will try to satisfy therequirement or support the feature, provided that it does notexcessively delay the work plan.

It is not required that a particular specification produced bythe W3C MMI working group addressesall the requirementsin this document. It is possible that the requirements be addressedby different specifications and that all the "MUST specify"requirement are only satisfied by combining the differentspecifications produced by the W3C Multimodal Interaction WorkingGroup. However, in such a case, it should be possible to clearlyindicate which specification will address what requirements.

Multimodal interactions

To lay the groundwork for the technical requirements, we firstdiscuss an intended frame of reference for a multimodal system,introducing various concepts and terms that will be referred to inthe normative sections below. For the reader's convenience, we havecollected the concepts and terms introduced in this frame ofreference in theglossary.

We are interested in defining the requirements for the design ofmultimodal systems -- systems thatsupport a user communicating with an application by using differentmodalities such as voice (in ahuman language), gesture,handwriting, typing,audio-visual speech, etc. The usermay be considered to be operating in adelivery context: a term used tospecify the set of attributes that characterizes the capabilitiesof the access mechanism in terms ofdeviceprofile,user profile (e.g.identify, preferences and usage patterns) andsituation. The user interacts with theapplication in the context of asession,using one or more modalities (which may be realized through one ormore devices). Within a session, the user maysuspend and resume interaction with theapplication within the same modality orswitch modalities. A session isassociated with acontext, whichrecords the interactions with the user.

In multimodal systems, anevent is arepresentation of some asynchronous occurrence of interest to themultimodal system. Examples include mouse clicks, hanging up thephone, speech recognition results or errors. Events may beassociated with information about the user interaction e.g. thelocation the mouse was clicked. A typical event source is a user,such events are calledinput events. Anexternal input event is one not generatedby a user, e.g. aGPS signal. The multimodalsystem may also produceexternal outputevents for external systems (e.g. a logging system). In orderto preserve temporal ordering, events may betime stamped. Typically, events areformalized as generated byeventsources, and associated witheventhandlers, whichsubscribe to theevent, and arenotified of its occurrence.This is exemplified by theXML Eventmodel.

The user typically provides input in one or more modalities, andreceives output in one or more modalities. Input may beclassified assequential,simultaneous orcomposite. Sequential input is inputreceived on a single modality, though that modality can change overtime. Simultaneous input is input received on multiple modalities,and treated separately by downstream processes (such asinterpretation). Composite input is input received on multiplemodalities at the same time and treated as a single, integrated"composite" input by downstream processes. Inputs are combinedusing thecoordinationcapability of the multimodal system, typically driven byinput constraints or decided by theinteraction manager.

Input is typically subject toinputprocessing. For instance, speech input may be input to aspeech recognition engine(including, for instance,semanticinterpretation in order to extract meaningful information (e.g.semantic representation) for downstreamprocessing. Note that simultaneous and composite input may beconflicting, in that theinterpretations of the input may not be consistent (e.g. the usersays "yes" but clicks on "no").

Two fundamentally different uses of multimodality may beidentified:supplementarymultimodality, andcomplementarymultimodality. An application makes supplementary use ofmultimodality if it allows to carry every interaction (inputor output) through to completion in each modality as if it was theonly available modality. Such an application enables the user toselect at each time the modality that is best suited to the natureof the interaction and the user's situation. Conversely, anapplication makes complementary use of multimodality ifinteractions in one modality are used to complement interactions inanother. (For instance, the application may visually displayseveral options in a form and aurally prompt the user "Choose thecity to fly to".) Complementary use may help a particular class ofusers (e.g. those with dyslexia). Note that in an applicationsupporting complementary use of different modalities eachinteraction may not be acessible separately in each modality.Therefore it may not be possible for the user to determine whichmodality to use. Instead, the document author may prescribe themodality (or modalities) to be used in a particularinteraction.

Thesynchronizationbehavior of an application describes the way in which any inputin one modality is reflected in the output in another modality, aswell as the way input is combined across modalities (coordination capability). Thesynchronization granularityspecifies the level at which the application coordinatesinteractions. The application is said to exhibitevent-levelsynchronization if user inputs in one modality are captured atthe level of the individualDOM events andimmediately reflected in the other modality. The applicationexhibitsfield-level synchronization if inputs in onemodality are reflected in the other after the user changes focus(e.g. moves from input field to input field) or completes theinteraction (e.g. completes a select in a menu). The applicationexhibitsform-level synchronization if inputs in onemodality are reflected in the other only after a particular pointin the presentation is reached (e.g. after a certain number offields have been completed in the form).

The output generatedstatus by a multimodal system can takevarious forms, e.g. audio (including spoken prompts and playback,e.g. usingnatural language generation,text-to-speech (TTS) whichsynthesizes audio), visual (e.g. XHMTL or SVGmarkup rendered on displays),lipsynch(multimedia output in which there is avisual rendition of a face whose lip movements are synchronizedwith the audio), etc. Of relevance here is the W3C RecommendationSMIL 2.0 which enables simple authoring ofinteractive audiovisual applications and supportsmedia synchronization.

Interaction (input, output) between the user and the applicationmay often be conceptualized as a series of dialogs, manged by aninteraction manager. A dialog is aninteraction between the user and the application which involvesturn taking. In each turn, the interaction manager manager(working on behalf of the application) collects input from theuser, processes it (using the session context and possibly externalknowledge sources) to determine , computes a response and updatesthe presentation for the user. An interaction manager generates orupdates the presentation by processing user inputs, session contextand possibly other external knowledge sources to determine theintent of the user. An interaction manager relies on strategies todetermine focus and intent as well as to disambiguate, correct andconfirm sub-dialogs. We typically distinguishdirected dialogs (e.g. user-driven orapplication-driven) andmixedinitiative or free flow dialogs.

The interaction manager may use (1) inputs from the user, (2)the session context, (3) external knowledge sources, and (4)disambiguation, correction, and configuration sub-dialogs todetermine the user's focus and intent. Based on the user's focusand intent, the interaction manager also (1) maintains the contextand state of the application, (2) manages the composition of inputsand synchronization across modalities, (3) interfaces with businesslogic, and (4) produces output for presentation to the user. Insome architectures, the interaction manager may havedistributed components, utilizingan event based mechanism for coordination.

Finally, in this document, we use the termconfiguration or execution model torefer to the runtime structure of the various system components andtheir interconnection, in a particular manifestation of amultimodal system.

1. General Requirements

1.1Scalability across wide range of device capabilities

It is the intent of the WG to define specifications that applyto a variety of multimodal capabilities and deploymentconditions.

(MMI-G1): The multimodal specificationsMUST support authoring multimodal applications for a wide range ofmultimodal capabilities (MUST specify).

The specifications should support different combinations ofinput and output modalities,synchronization granularity,configurations anddevices. Some aspects of this requirement areelaborated in detail below. For instance, the range ofsynchronization granularity isaddressed by requirementMMI-A6.

It is advantageous that the specifications allow the applicationdeveloper to author a single version of the application, instead ofmultiple versions targeted at combinations of multimodalcapabilities.

(MMI-G2): The multimodal specificationsSHOULD support authoring multimodal applications once fordeployment on difference devices with different multimodalcapabilities (NICE to specify).

The multimodal capabilities may differ based on availablemodalities, presentation and interaction capability for eachmodality (modality-specific delivery context), synchronizationgranularity, available devices and their configurationsetc... They are to be captured in the delivery contextassociated to the multimodal system.

1.2Supplementary and complementary use of different modalities

(MMI-G3): The multimodal specificationsMUST supportsupplementary use ofmodalities (MUST specify).

Supplementary use of modalities in multimodal applicationssignificantly improves accessibility of the applications. The usermay select the modality best used to the nature of the interactionand the context of use.

When supported by the runtime or prescribed by the author, itmay be possible for the user to combine modalities as discussed forexample in requirementMMI-I7 about compositeinput.

(MMI-G4): The multimodal specificationsMUST supportcomplementary use ofmodalities (MUST specify).

Authors of multimodal applications that rely on complementarymultimodality should pay special attention to the accessibility ofthe application, for example by ensuring accessibility in eachmodality or by providing supplementary alternatives.

1.3 Seamlesssynchronization of modalities

(MMI-G5): The multimodalspecifications will be designed such that an author can writeapplications where thesynchronization of the variousmodalities is seamless from the user's point of view (MUSTspecify).

To elaborate, an interaction event or an external event in onemodality results in a change in another; based on thesynchronization granularitysupported by the application. Seesection 4.5 for adiscussion of synchronization granularities.

Seamlessness canencompass multiple aspects:

Limited latency in thesynchronization behavior with respect to what is expected by theuser for the particular application and multimodalcapabilities.
Predictable,non-confusing multimodal behavior

Expanding on the considerations made insection1.1, it is important to support authoring for any granularityof synchronization covered in(MMI-A6):

(MMI-G6): The multimodalspecifications MUST support authoring seamless synchronization ofvarious modalities for any anysynchronization granularityandcoordination capabilities (MUSTspecify).

Coordination is defined as the capability to combine multimodalinputs into composite inputs based on an interpretation algorithmthat decides what makes sense to combine based on the context.Composite inputs are further discussed insection 2.4. It is a notiondifferent from synchronization granularity described insection 4.5.

The following requirement is proposed in order to address thecombinatorial explosion of synchronization granularities that theapplication developer must author for.

(MMI-G7): The multimodalspecifications SHOULD support authoring seamless synchronization ofvarious modalities once for deployment across with a whole rangeofsynchronizationgranularityorcoordination capabilities (NICEto specify).

This requirement addresses the capability for the applicationdeveloper to write the application once for a particularsynchronization granularity or coordination capability and to havethe application able to adapt its synchronization behavior whenother levels are available.

1.4 Multilingual support

Multimodal applications are not different from any other webapplications. It is important that the specifications be notlimited to specific languages.

(MMI-G8): The multimodalspecifications MUST support authoring multimodal applications inanyhuman language (MUSTspecify).

In particular, it must be possible to apply conventional methodsfor localization and internationalization of applications.

(MMI-G9): The multimodalspecification MUST not preclude the capability to move multimodalapplication from onehuman language to another, withouthaving to rewrite the whole application (MUST specify).

For example, it should be possible to encapsulatelanguage-specific items, separately encapsulated from thelanguage-independent description.

1.5 Easy to implement

It is important that multimodal applications remain easy toauthor and deploy in order to allow wide adoption by the webcommunity.

(MMI-G10): The multimodalspecifications produced by the MMI working group MUST be easy toimplement and use (MUST specify).

This is a generic requirement that requires designers toconsider from the outset issues of: ease-of-authoring byapplication developers; ease-of-implementation by platformdevelopers and ease-of-use by the user. Thus it affects authoring,platform implementation and deployment.

The following requirement qualifies this further to guaranteethat the specifications will be widely deployable with existingtechnologies (e.g. standards, network and client capabilitiesetc...)

(MMI-G11): The multimodalspecifications produced by theMMI workinggroup MUST depend only on technologies that are widelyavailable during the lifetime of the working group (MUSTspecify).

For W3C specifications, wide availability is understood ashaving reached at least the stage of candidate recommendation.

Related considerations are made insection 4.1.

1.6 Accessibility

Multimodal applications will provide mechanisms to develop anddeploy accessible applications as discussed insection1.2.

In addition, it is important that, as for all other webapplications; the following requirement be satisfied:

(MMI-G12): The multimodalspecifications produced by theMMI working group MUST not precludeconforming to the W3C accessibility guidelines (MUST specify).

This is especially important for applications that makecomplementary use of modalities.

1.7 Security and privacy

Early deployments of multimodal applications show that securityand privacy issues can be very critical for multimodal deployments.While addressing these issues is not directly within the scope ofthe W3C Multimodal Interaction Working Group, it is important thatthese issues be considered.

(MMI-G13): The multimodalspecifications SHOULD be aligned with the W3C work andspecifications for security and privacy (SHOULD specify).

The following security and privacy issues have beenidentified for multimodal and multi-device interactions.

Security:
- In some distributedconfigurations:
  - the exchange of interaction events that can be intercepted byunauthorized third parties. This would enable reconstruction of thecomplete interaction with the application; especially in betweensubmits to the backend. Any note, temporary selections etc would beaccessible!
  - unauthorized third parties may be able to issue presentationmanipulations that would affect the user agent.
Privacy:
- In some distributed configurations, the interaction events mayenable reconstruction of the complete interaction with theapplication, including in between submits to the backend. Thisinformation or aspect of it may be considered as private by theuser.
- User profiles (preferences and usage habits), used to optimizethe user's interaction with multimodal applications, includesinformation that users may consider as private.

Other considerations and issues may exist and should becompiled.

1.8 Delivery and context

Notions of profile anddeliverycontext have been widely introduced to characterize the thecapabilities of devices and preferences of users.

From a multimodal point of view, different types of profiles arerelevant:

User profile that may include usercredentials and user preferences and usage patterns that capturesthe information manually or automatically the way that a userinteracts or likes to interact with a multimodal application
Device profiles that captures thecharacteristics the capability of a particular devices used toaccess an application.

These profiles are combined into the notion ofdelivery context introduced by the W3Cdevice independent activity[DI Activity]. The delivery context captures the set of attributes thatcharacterize the capabilities of the access mechanism (device ordevices) (device profile), the dynamic preferences of the user (asthey relates to interaction through this device) andconfigurations. Delivery context maydynamically change as the application progresses, as the usersituation changes (situationalization) or as the number andconfigurations of the devices change.

CC/PP is an example of formalism to describe and exchange thedelivery context[CC/PP].

Users of multimodal interactions will expect to be able to relyon these profiles to optimize the way that multimodal applicationsare presented to them.

(MMI-G14): The multimodalspecifications MUST enable optimization and adaptation ofmultimodal applications based ondelivery context or dynamic changes ofdelivery context (MUST specify).

Dynamic changes of delivery context encompass situations whereavailable devices, modalities and configurations; or usagepreferences dynamically. These changes can be involuntary orinitiated by the user, the application developer or the serviceproviders.

(MMI-G15): The multimodalspecifications MUST enable authors to specify howdelivery context and changes ofdelivery context affect the multimodal interface of a particularapplication (MUST specify).

The description of such impacts on a multimodal applicationcould be specified by the author but modified by the user, platformvendor or service provider. In particular, the author can describehow the application can be be affected or adapted to the deliverycontext but the user and service providers should be able to modifythe delivery context.Other use cases should also beconsidered.

1.9 Navigation specification

It is expected that the author of multimodal application shouldalways be able to specify the expected flow of navigation (i.e.sequence of interaction) through the application or the algorithmto determine such a flow (e.g. in mixed initiative cases). Thisleads to the following requirement:

(MMI-G16): The multimodalspecifications MUST enable the author of an application to describethe navigation flow through the application or indicate thealgorithms to determine the navigation flow (MUSTspecify).

2. Input ModalityRequirements

2.1 Input processing

Numerous modalities or input types require some form ofprocessing before the nature of the input is identified. Forinstance, speech input requires speech detection and speechrecognition which requires specific data files (e.g. grammars,language models etc). Similarly handwritten input requiresrecognition.

(MMI-I1): The multimodal specificationsMUST provide a mechanism to specify and attach modality relatedinformation when authoring a multimodal application. (MUSTspecify).

This implies that authors should be able to includemodality-related information, such as the media types, processingrequirements or fallback mechanisms that a user agent will need forthe particular modality. Mechanisms should be available to makethis available to the user agent.

For example, audio input may be recognized (speech recognizer),recorded or processed by speaker recognizers, natural languageprocessing, using specific data files (e.g. grammar, languagemodel), etc. The author must be able to completely define suchprocessing steps.

2.2 Sequential multimodalinput

(MMI-I2): The multimodalspecifications developed by the MMI working group MUST supportsequential multimodal input (MUSTspecify).

It implies that

(MMI-I2a): It MUST bepossible to authorsequentialmultimodal applications, where inputs across modality areprovided sequentially (MUST specify).
(MMI-I2b): It MUST be possible tospecify what modality or device to use for input insequential multimodality and hint orenforcemodality switches. This is anapplication developer's capability (MUST specify).
(MMI-I2c): Thespecifications MUST enable writing multimodal applications wherethe user can select what modality or device to use at any timebased on the user'ssituation and thenature of the input interactions. More concretely, thespecifications must support writing multimodal applications thatcan be accessed through each modality alone, and that supportmodality switches whenever desired bythe user (MUST specify).

2.3 Simultaneous multimodalinput

(MMI-I3): The multimodalspecifications developed by the MMI working group MUST supportsimultaneous multimodal input (MUSTspecify).

(MMI-I4): The multimodal specificationsMUST enable the author to specify thegranularity of inputsynchronization (MUST specify).

It should be remarked, however, that the actual granularity ofinput synchronization may be decided by the user, by the runtime orby the network (delivery context) or some combination thereof.

(MMI-I5): The multimodalspecifications MUST enable the author to specify how the multimodalapplication evolves when thegranularity of inputsynchronizationis modified byexternal factors (MUST specify).

This requirement enables the application developer to specifyhow the performance of the application can degrade gracefully withchanges in the input mechanism. For instance, it should be possibleto access an application designed for event-level or field-levelsynchronization between voice (on the server side) and GUI (on theterminal) on a network that permits only session-levelsynchronization (that is, permits onlysequential multimodality).

(MMI-I6): The multimodalspecifications SHOULD enable a default input synchronizationbehavior and provide "overwrite" mechanisms (SHOULDspecify).

Therefore, it should be possible to author multimodalapplications while assuming a default synchronization behavior. Forexample,supplementary event-levelmultimodalsynchronizationgranularity.

2.4 Composite multimodalinput

(MMI-I7): The multimodal specificationsdeveloped by the MMI working group MUST supportcomposite multimodal input (MUSTspecify).

(MMI-I8): The multimodalspecifications SHOULD allow the author to specify how inputcombination is achieved, possibly taking into account thecoordination capabilitiesavailable in the givendeliverycontext (NICE to specify).

This can be achieved with explicit scripts that describe theinterpretation and composition algorithms. On the other hand, itmay also be left to theinteractionmanager to apply an interpretation strategy that includescomposition, for example by determining the most sensibleinterpretation given thesessioncontext and therefore determining what input combination (ifany) to select. This is addressed by the following requirement.

(MMI-I9): The multimodalspecifications SHOULD enable the author to specify the mechanismused to decide when coordinated inputs are to be combined and howthey are combined (NICE to specify).

Possible ways to address this include:

Time windowing
Interaction management strategy or algorithms based on orderingof events and context

2.5 Input modes supported

2.5.1 MUST specify

(MMI-I10): The multimodalspecifications must support the description of input to be obtainedfrom:

Keyboard / keypad (e.g. keyboard (i.e. Qwerty (i.e. USkeyboard), etc), Handset (e.g. DTMF) or customized keypad).
Pointing devices (e.g. mouse, stylus, touch screen)
Combined input interfaces like joystick and gamecontrollers
Audio input (e.g. speech input to be recognized orrecorded)
Video input
Sign languages
Pen / stylus handwriting and stroke input.
- (hand-writing script and hand-writing gesture - e.g. to delete,to insert)
- This incorporates stroke input and recognized handwriting. Thisis expected to be addressed by requirementMMI-I1.

(MUSTspecify).

2.5.2 NICE to specify

(MMI-I11): The multimodalspecifications SHOULD support other input modes,including:

gaze recognition (e.g. as a pointer).
- This is also expected to be covered by the "pointing" aspect ofMMI-10 and requirementMMI-I1.
Combined audio-visual speech recognition.
Haptic and tactile input
Non-spoken audio input (e.g. hummed tune, songs

(NICE tospecify).

2.5.3 Extensibility

(MMI-I12): The multimodalspecifications MUST describe how extensibility is to be achievedand how new devices or modalities can be added (MUST specify).

2.6 Semantics ofinput generated by UI components

(MMI-I13): The multimodalspecifications MUST support the representation of the meaning of auser input (MUST specify).

(MMI-I14): The representation of themeaning may be modality or device dependent. However, wheneverpossible, the representation of the meaning SHOULD be independentof the input modality (NICE to specify).
(MMI-I15): Therepresentation of the input SHOULD indicate the modality(ies) wherethe input(s) was (were) provided (SHOULD specify).

2.7 Coordinatedconstraints

(MMI-I16): The multimodalspecifications MUST enable to coordinate theinput constraints across modalities(MUST specify).

Input constraints specify, for example through grammars, howinputs are can be combined via rules or interaction managementstrategies. For example the markup language may coordinatesgrammars for modalities other than speech with speech grammars toavoid duplication of effort in authoring multimodal grammars.

Possible ways to address this could include:

Coordinated Grammars.
Constraints expressed in the data model (e.g. XForms, XMLSchema)
Interaction management algorithm or strategy.

These methods will be considered during the specificationwork.

2.8Support for conflicting input from different modalities

When using multiple modalities or user agents, a user mayintroduce errors consciously or inadvertently. For example in avoice and GUI multimodal application, the user may say "yes"simultaneously click on "no" in the user interface. We require thatthe specifications detect such conflict.

(MMI-I17): The multimodalspecifications MUST support the detection of conflicting input fromseveral modalities (MUST specify).

It is naturally expected that the author will specify how tohandle the conflict through an explicit script or piece of code. Itis also possible that an interaction management strategy will beable to detect the possible conflict and provide a strategy orsub-dialog to resolve it.

2.9 Temporal positioning ofinput events

Theinteraction manager should beable to place different input events on the timeline, in order todetermine the intent of the user.

(MMI-I18): The multimodalspecifications MUST provide mechanisms to position the input eventsrelatively to each other in time (MUST specify).

(MMI-I19): The multimodalspecifications SHOULD provide mechanisms to allow for temporalgrouping of input events (SHOULD specify).

These requirements may by satisfied by mechanisms to order ofthe input events or, when needed, relative time stamping. For someconfigurations, this may involve clock synchronization.

3. Output Media Requirements

3.1 Sequential media output

(MMI-O1): The multimodalspecifications developed by the MMI working group MUST supportsequential media output (MUST specify).

AsSMIL supports the sequencing of medias,the specification is expected to rely on similar mechanism. This isaddressed in more details in other requirements.

It implies that

(MMI-O1a): It MUST be possible toauthorsequential multimodalapplications where output medias are sequentially presented to theuser (MUST specify).
(MMI-O1b): It MUST be possible tospecify what modality or device to use for output insequential multimodal and hint or enforcemodality switches. This is anapplication developer's capability (MUST specify).
(MMI-O1c): The specifications MUSTenable writing multimodal applications where the user can selectwhat modality to use at any time based on the user'ssituation and the nature of the outputinteractions. More concretely, the specifications must supportwriting multimodal applications that can be accessed through eachmodality alone, and that supportmodalityswitches whenever desired by the user (MUST specify).

3.2. Simultaneous mediaoutput

(MMI-O2): The multimodal specificationsMUST provide the ability to synchronize different output mediaswith different granularities (MUST specify).

This covers simultaneous outputs. The granularity of outputsynchronization as provided by SMIL may range from nosynchronization at all between the medias other than the play inparallel to tightly synchronization mechanisms.

(MMI-O3): The multimodalspecifications MUST enable the author to specify the granularity ofoutput synchronization (MUST specify).

However, it should be possible that the granularity of outputmedia synchronization be decided by the user (delivery context)runtime or network.

(MMI-O4): The multimodal markup MUSTenable the author to specify how the multimodal application degradewhen the granularity of output synchronization is modified byexternal factors (MUST specify).

(MMI-O5): The multimodal specificationsSHOULD rely on a default output synchronization behavior for aparticulargranularity and it shouldprovide "overwrite" mechanisms (SHOULD specify)

3.3 Supported output medias

3.3.1 MUST specify

(MMI-O6): The multimodal specificationsMUST support as output media:

Audio, including spoken prompts and playback
Visual (XHTML, SVG), encompassing different displaycharacteristics (monitor, PDA, smart phone etc...)
SMIL objects (animation, audio, img, video, text,textstream)
Synthesis of audio,
MIDI
Streaming
Sign languages

(MUST specify).

3.3.2. Nice to specify

(MMI-O7): The multimodal specificationsSHOULD support additional media outputs like:

media types supported by CSS3
lip-synch face synthesis
tactile and haptic output

(NICE to specify).

3.3.3. Extensibility

(MMI-O8): The multimodal specificationsMUST describe how extensibility is to be achieved and how newoutput medias can be added (MUST specify).

3.4 Output processing

(MMI-O9): The multimodal specificationsMUST support the specification of which output media should beprocessed and how it should be done. The specification MUST providea mechanism that describe how this can be achieved or extended fordifferent modalities (MUST specify).

Examples of output processing may include: adaptation or stylingof presentation for particular modalities, speech synthesis of textoutput into audio output, natural language generation, etc...

4.Architecture, integration and synchronization points

4.1 Reuse standard markuplanguages

(MMI-A1): Where the functionality isappropriate, and clean integration is possible, the multimodalspecifications must enable the use and integration of existingstandard language specifications including visual, aural, voice andmultimedia standards (MUST specify).

In general, it is understood that in order to satisfyMMI-G11, dependencies of the multimodalspecifications on other specifications must be carefully evaluatedif these are not yet W3C recommendations or not yet widelyadopted.

SMIL 2.0 provide multimedia synchronization mechanisms.Therefore,MMI-A1 implies:

(MMI-A1a): The multimodalspecifications MUST enable the synchronization of input and outputmedia through SMIL2.0 as control mechanism (MUST specify).

4.2 XHTML Modularization

The following requirement results fromMMI-A1.

(MMI-A2): The multimodal specificationsMUST be expressible in terms of XHTML modularization (MUSTspecify).

4.3 Separation of data model,presentation layer and application logic

(MMI-A3): The multimodal specificationMUST allow the separation of data model, presentation layer andapplication logic in the following ways:

Enable an explicit data model for the back end (i.e. the data)and its mapping to the front end.
Enable the separation of the data model from the presentation.The presentation depends on the device modality.
Application data must be modality independent
Logic should be modality independent.

(MUST specify).

This will enable the multimodal specifications to be compatiblewith XForms in environments which support XForms. This would complywithMMI-A1.

4.4 Detection of availablemodalities and changes

From an authoring point of view, it is important to havemechanisms (events, protocols, handlers) to detect or prescribe themodalities that are or should be available: i.e. to check thedelivery context and to adapt to the delivery context. This iscovered byMMI-G14 andMMI-G15.

(MMI-A4): There MUST be eventsassociated to changes ofdeliverycontext and mechanisms to specify how to handle these events byadapting the multimodal application (MUST specify).

(MMI-A5): There SHOULD be mechanismsavailable to define thedeliverycontext or behavior that is expected or recommended by theauthor (SHOULD specify).

4.5 Synchronizationgranularities

(MMI-A6): The multimodal specificationsMUST support thesynchronizationgranularities at the following levels of synchronization:

Form-level input synchronization: Inputs in one modality arereflected in the other only after reaching a particular point inpresentation (e.g. completing a certain amount of fields in aform).
Field-level input synchronization: Inputs in one modality arereflected in the other after the user finishes performing aparticular interaction with a field. This can be detected becausethe user in general changes / sets a value in the data model. Forexample, this results from a changes focus (e.g. move from inputfield to input field) or completes the interaction (e.g. complete aselect in a menu).
Page-level: Inputs in one modality are reflected in the otheronly after submission of the page.
Event-level synchronization: User inputs in one modality arecaptured at the level the individual DOM events and immediatelyreflected in the other modality; when it makes sense
Event-level input synchronization with output media
Media synchronization: Synchronization between output media asspecified by SMIL
Session level:Suspend and resumebehavior; an application suspended in one modality can be resumedin the same or another modality.

(MUST specify).

In addition,

(MMI-A6a): It MUST be possible toauthorsequential multimodalapplications (MUST specify).
(MMI-A6b): It MUST be possible tospecify what modality or device to use for interaction insequential multimodal cases and hint orenforcemodality switches. This is anapplication developer's capability (MUST specify).
(MMI-A6c): The specifications MUSTenable writing multimodal applications where the user can selectwhat modality or device to use at any time based on the user'ssituation and the nature of the input andoutput interactions. More concretely, the specifications mustsupport writing multimodal applications that can be accessedthrough each modality alone, and that supportmodality switches whenever desired bythe user (MUST specify).

The following requirement results fromMMI-A1.

(MMI-A7a): Event-level synchronizationMUST follow theDOM event model (MUSTspecify).

(MMI-A7b): Event-level synchronizationSHOULD followXML events (SHOULDspecify).

Such events are not limited to events generated by userinteractions as discussed inMMI-A16.

It is important that the application developer be able to fullydefine the synchronization granularity.

(MMI-A8): The multimodal specificationsMUST enable the author to specify thegranularity of synchronization(MUST specify).

However:

(MMI-A9): It MUST be possible that thegranularity of synchronization be decided by the user runtime ornetwork (through thedeliverycontext) (MUST specify).

(MMI-A10): The multimodalspecifications MUST enable the author to specify how the multimodalapplication degrade when thegranularity of synchronization ismodified by external factors (MUST specify).

(MMI-A11): The multimodalspecifications should rely on an input and outputdefault synchronization behaviorand it should provide "overwrite" mechanisms (SHOULD specify).

4.6 Independent input and outputinterfaces even in a same modality

Nothing imposes that input and output, even in a same modality,be provided in a same device or user agent. The input and outputcan be independent and the granularity of interfaces afforded bythe specification should apply independently to the mechanisms ofinput and output within a given modality when necessary.

(MMI-A12): The specification MUSTsupport separate interfaces for input and output even within a samemodality (MUST specify).

4.7 Distributedsynchronization

(MMI-A13): The multimodalspecifications MUST supportsynchronization of differentmodalities or devicesdistributed across the network,providing the user with the capability to interact throughdifferent devices (MUST specify).

In particular, this includes multi-device applications wheredifferent devices or user agents are used to interact with a sameapplications; these may involve presentation in the same modalitybut on different devices.

4.8 Distributed processing

Distribution of input and output processing refers to caseswhere the processing algorithms applied on input and output may beperformed by distributed components.

(MMI-A14): The multimodalspecifications MUST support the distribution of input andoutput processing (MUSTspecify).

(MMI-A15): The multimodalspecifications MUST support the expression of some level of controlover the distributed processing of input and output processing(MUST specify).

This requirement is related toMMI-I1 andMMI-O9.

4.9 External input and output

(MMI-A16): The multimodalspecifications MUST enable author to specify how multimodalapplications handle external input events and generate externaloutput events used by other processes (MUST specify).

Examples of input events include camera, sensors or GPS events.Example of output event include any form of notification or triggergenerated by the user interaction.

This is expected to be automatically satisfied if events aretreated asXML events.

4.10 Temporalpositioning of input and output events

RequirementsMMI-I8 andMMI-I9 generalize as follows.

(MMI-A17): The multimodalspecifications MUST provide mechanisms to position the input andoutput events relatively to each other in time (MUST specify).

(MMI-A18): The multimodalspecifications SHOULD provide mechanisms to allow for temporalgrouping of input and output events (SHOULD specify).

These requirements may by satisfied by mechanisms to order ofthe events or, when needed, relative time stamping. For someconfigurations, this may involve clock synchronization.

5. Runtimes and deployments

5.1 Configurations

It is expected that users will interact with multimodalapplications through different deployment configurations (i.e.architectures): the different modules responsible for mediarendering, input capture, processing, synchronization,interpretation etc, may be partitioned or combined on a singledevice or distributed across several devices or servers. Aspreviously discussed, these configurations may dynamicallychange.

The specifications of such configuration is beyond the scope ofthe W3C Multimodal Interaction Working Group. However:

(MMI-C1): The multimodal specificationsMUST support the deployment of multimodal applications authoredaccording the W3C MMI specifications, with all the relevantdeployment configurations where functions are partitioned orcombined on a single engine or distributed across several devicesor servers (MUST specify).

The possibility to interact with multiple devices leadsnaturally to multi-user access to applications.

(MMI-C2): The multimodal specificationsSHOULD support multi-user deployments (NICE to specify).

5.2 Mobile deployments

Multimodal interactions are especially important for mobiledeployments. Therefore, the W3C multimodal working group will payattention to the constraints associated to mobile deployments andespecially cell phones.

(MMI-R1): The multimodal specificationsMUST be compatible with deployments based on user agents /renderers that run on mobile platforms (MUST specify).

Mobile platforms, like smart phones, are typically constrainedin terms of processing power and memory available. It is expectedthat the multimodal specifications will take such constraints intoaccount and be designed so that multimodal deployments are possibleon smart phones.

In addition, it is important to pay attention to the challengesintroduced by mobile networks like: limited bandwidth, delaysetc...:

(MMI-R2): The multimodal specificationsMUST support deployments over mobile networks, considering thebandwidth limitations and delays that they may introduce (MUSTspecify).

This may enable deployment techniques or specification fromother standard activity to provision the necessary quality ofservice.

5.3 EMMA

The following requirements apply to the objectives for thespecification work on EMMA as defined in theglossary. EMMA is intended to support thenecessary exchanges of information between the multimodal modulesmentioned insection 5.1.

(MMI-E1): The multimodal specificationsMUST support the generation, representation and exchange of inputevents and results of input or output processing (MUST specify)

(MMI-E2): The multimodal specificationMUST support the generation, representation and exchange ofinterpretation and combinations of input event and results of inputor output processing (MUST specify).

5.4 Multimodalsynchronization exchanges

(MMI-S1): The multimodal specificationsMUST enable to author the generation of asynchronous events andtheir handler (MUST specify).

(MMI-S2): The multimodalspecifications MUST enable to author the generation of synchronousevents and their handler (MUST specify).

(MMI-S3): The multimodalspecifications MUST support event handlers local to the eventgenerator (MUST specify).

(MMI-S4): The multimodal specificationsMUST support event handlers remote to the eventgenerator.

(MMI-S5): The multimodalspecifications MUST support the exchange of EMMA fragments as partof the synchronization events content (MUST specify).

(MMI-S6): The multimodalspecifications MUST support the specification of event handlers forexternally generated events (MUST specify).

(MMI-S7): The multimodalspecifications MUST support the specification of event handlers forexternally generated events that result from the interaction of theuser (MUST specify).

(MMI-S8): The multimodalspecifications MUST support handlers that manipulate or update thepresentation associated to a particular modality (MUSTspecify).

In distributed configurations, it is important thatsynchronization exchanges take place with minimum delays. Inpractical deployments this implies that the highest availablequality of services should be allocated to such exchanges.

(MMI-S9): The multimodalspecifications MUST enable the identification of multimodalsynchronization exchanges. (MUST specify)

This would enable the underlying network to allocate the highestquality of services associated to synchronization exchanges, if itis aware of such needs. This network behavior is beyond the scopeof the multimodal specifications.

(MMI-S10): The multimodalspecifications MUST support confirmation of event handling (MUSTspecify).

(MMI-S11): The multimodalspecifications MUST support event generation or event handlingpending confirmation of a particular event handling (MUSTspecify).

(MMI-S12a): The multimodalspecifications MUST be compatible with existing standards includingDOM events andDOMspecifications (MUST specify).

(MMI-S12b): Themultimodal specifications SHOULD be compatible with existingstandards includingXML eventsspecifications (SHOULD specify).

(MMI-S13):The multimodalspecification MUST allow lightweight multimodal synchronizationexchanges compatible with wireless network and mobile terminals(MUST specify).

This last requirement is derived fromMMI-R1 andMMI-R2.

6. References

[CC/PP]: W3C CC/PPWorking Group, URI:http://www.w3c.org/Mobile/CCPP/.

[DIactivity]: W3C Device Independent Activity, URI:http://www.w3c.org/2001/di/.

[MMIcharter]: W3C Multimodal InteractionWorking group Charter, URI:http://www.w3c.org/2002/01/multimodal-charter.html.

[MMIWG]: W3C Multimodal InteractionWorking Group, URI:http://www.w3c.org/2002/mmi/.

[MM ReqVoice]: Multimodal Requirements forVoice Markup Languages, W3C Working Draft, URI:http://www.w3c.org/TR/multimodal-reqs.

7. Acknowledgements

Thissection is informative.

This document was jointly prepared by the members of the W3CMultimodal Interaction Working Group.

Special acknowledgments to Jim Larson (Intel) and Emily Candell(Comverse) for their significant editorial contributions.

Appendices

Appendix A: Use cases

A.1 Overview of the use cases

Analysis of use cases provides insight into the requirements forapplications likely to require a multimodal infrastructure.

The use cases described below were selected for analysis inorder to highlight different requirements resulting fromapplication variations in areas such as device requirements, eventhandling, network dependencies and methods of user interaction

Use Case Device Classification

Thin Client

A device with little processing power and capabilities that canbe used to capture user input (microphone, touch display, stylus,etc) as well as non-user input such as GPS. The device may have avery limited capability to interpret the input, for example a smallvocabulary speech recognition, or a character recognizer. The bulkof the processing occurs on the server including natural languageprocessing and interaction management.

An example of such a device may be a mobile phone with DSRcapabilities and a visual browser (there could actually be thinnerclients than this).

Thick Client

A device with powerful processing capabilities, such that mostof the processing can occur locally. Such a device is capable ofinput capture and interpretation. For example, the device can havea medium vocabulary speech recognizer, a handwriting recognizer,natural language processing and interaction managementcapabilities. The data itself may still be stored on theserver.

An example of such a device may be a recent production PDA or anin-car system.

Medium Client

A device capable of input capture and some degree ofinterpretation. The processing is distributed in a client/server ora multi-device architecture. For example, a medium client will havethe voice recognition capabilities to handle small vocabularycommand and control tasks but connects to a voice server for moreadvanced dialog tasks.

Use Case Summaries

Form Filling for air travel reservation

Description	Device Classification	Device Details	Execution Model
The means for a user to reserve a flight using a wirelesspersonal mobile device and a combination of input and outputmodalities. The dialog between the user and the application isdirected through the use of a form-filling paradigm.	Thin and medium clients	touch-enabled display (i.e., supports pen input), voice input,local ASR and Distributed Speech Recognition Framework, localhandwriting recognition, voice output, TTS, GPS, wirelessconnectivity, roaming between various networks.	Client Side Execution

Scenario Details

User wants to make a flight reservation with his mobile devicewhile he is on the way to work. The user initiates the service viameans of making a phone call to a multimodal service (telephonemetaphor) or by selecting an application (portal environmentmetaphor). The details are not described here.

As the user moves between networks with very differentcharacteristics, the user is offered the flexibility to interactusing the preferred and most appropriate modes for the situation.For example, while sitting in a train, the use of stylus andhandwriting can achieve higher accuracy than speech (due tosurrounding noise) and protect privacy. When the user is walking,the input and output modalities that more appropriate would bevoice with some visual output. Finally, at the office the user canuse pen and voice in a synergistic way.

The dialog between the user and the application is driven by aform-filling paradigm where the user provides input to fields suchas "Travel Origin:", "Travel Destination:", "Leaving on date","Returning on date". As the user selects each field in theapplication to enter information, the corresponding inputconstraints are activated to drive the recognition andinterpretation of the user input. The capability of providingcomposite multimodal input is also examined, where input frommultiple modalities is combined for the interpretation of theuser's intent.

Driving Directions

Description	Device Classification	Device Details	Execution Model
This application provides a mechanism for a user to request andreceive driving directions via speech and graphical input andoutput	Medium Client	on-board system (in a car) with a graphical display, mapdatabase, touch screen, voice and touch input, speech output, localASR and TTS Processing and GPS.	Client Side Execution

Scenario Details

User wants to go to a specific address from his current locationand while driving wants to take a detour to a local restaurant (Theuser does not know the restaurant address nor the name). The userinitiates service via a button on his steering wheel and interactswith the system via the touch screen and speech.

Name Dialing

Description

Device Classification

Device Details

Execution Model

The means for users to call someone by saying their name.

thin and fat devices

Telephone

The study covers several possibilities:

whether the application runs in the device or the server
whether the device supports limited local speechrecognition

These choices determine the kinds of events that are needed tocoordinate the device and network based services.

Scenario Details

Janet presses a button on her multimodal phone and says one ofthe following commands:

Call Wendy
Call Wendy on her cell phone
Call Wendy at work
Call Wendy Smith at Acme Research

The application initially looks for a match in Janet's personalcontact list and if no match is found then proceeds to look inother directories. Directed dialog and tapered help are used tonarrow down the search, using aural and visual prompts. Janet isable to respond by pressing buttons, or tapping with a stylus, orby using her voice.

Once a selection has been made, rules defined by Wendy are usedto determine how the call should be handled. Janet may see apicture of Wendy along with a personalized message (aural andvisual) that Wendy has left for her. Call handling may depend onthe time of day, the location and status of the both parties, andthe relationship between them. An "ex" might be told to never callagain, while Janet might be told that Wendy will be free in half anhour after Wendy's meeting has finished. The call may beautomatically directed to Wendy's home, office or mobile phone, orJanet may be invited to leave a message.

A.2 Event analysis

The use-case analysis exercise helped to identify the types ofevents a multimodal system would likely need to support.

Based on the use case analysis, the following eventsclassifications were defined:

Asynchronous vs. Synchronous
Local vs. remote generation
Local vs. remote handling
Input interpretation Events
Externally generated events vs. Events generated as a result ofuser action
Actions vs. Notifications

The events from the use cases described above have beenconsolidated in the following table.

Event Table:

	Event Type	Asynchronous vs. Synchronous	Local vs. remote generation	Local vs. remote handling	Input inter- pretation	External vs. User	Notifications vs. actions	Comments
1.	Data Reply Event	Synchronous	Remote	Local	No	External	Notification	Event containing results from a previous data request
2.	HTTP Request	Asynchronous	Local	Remote	No	External	N/A	A request sent via the HTTP Protocol
3.	GPS_DATA_in	Synchronous	Remote	Local	No	External	Notification	Event containing GPS Location Data
4.	Touch Screen Event	Asynchronous	Local	Local	Yes	User	Action	Event that contains coordinates corresponding to a location ona touch screen
5.	Start_Listening Event	Asynchronous	Local / Remote	Local / Remote	No	User	Action	Event to invoke the speech recognizer
6.	Return Reco Results	Synchronous	Local / Remote	Local	Yes	External	Notification	Event containing the results of a recognition
7.	Alert	Asynchronous	Remote	Local	No	External	Notification	Event containing unsolicited data which may be of use to anapplication
8.	Register User Ack	Synchronous	Remote	Local	No	External	Notification	Event acknowledging that user has registered with theservice
9.	Call	Asynchronous	Local	Remote	No	User	Action	Request to place an outgoing call
10.	Call Ack	Synchronous	Remote	Local	No	External	Notification	Event acknowledging request to place an outgoing call
11.	Leave Message	Asynchronous	Local	Remote	No	User	Action	Request to leave a message
12.	Message Ack	Synchronous	Remote	Local	No	External	Notification	Event acknowledging request to leave a message
13.	Send Mail	Asynchronous	Local	Remote	No	User	Action	Request to send a message
14.	Mail Ack	Synchronous	Remote	Local	No	External	Notification	Event acknowledging request to send a message
15.	Register_Device_Profile (delivery_context)	Synchronous	Local	Remote	No	External	Notification	Occurs on connection
16.	Update_Device_Profile (delivery_context)	Asynchronous/ Synchronous	Local	Remote	No	External/ User	Notifiication	The user selects a new set of modalities by pressing a buttonor making menu selections (synchronous event). If the devicecan detect changes in the network or location via GPS or beacons,then the event is asynchronous.
17.	On_Focus (field_name)	Synchronous	Local	Remote	No	User	Action	Event sends selected field to multimodal synchronization serverfor the purpose of loading the appropriate input constraints forthe field.
18.	Handwriting_Reco ()	Synchronous	Local	Local	Yes	User	Action	Event to invoke the handwriting recognizer (HWR) after peninput in a field. In the current scenario, we consider that HWR ishandled locally, but this may be expanded later to include remoteprocessing.
19.	Submit_Partial_Result ()	Synchronous	Local	Remote	No	External	Notification	Result of recognition of field input is sent to the server
20.	Send_Ink (ink_data, time_stamp)	Synchronous	Local	Remote	Yes	User	Action	Ink collected for a pen gesture is sent to multimodal serverfor integration. As before, this event associates time stampinformation with the ink data for synchronization.The result of thepen gesture can be transmitted as a sequence of (x,y) coordinatesrelative to the device display,
21	Collect_Pen_Input ()	Synchronous	Local	Local	Yes	User	Action	Ink collection could be interpreted firstlocally into basic shapes (i.e, circles, lines) and have thosetransmitted to the server.
22	Send_Gesture (gesture_data, time_stamp)	Synchronous	Local	Remote	Yes	User	Action	The server can provide a deeper semanticinterpetation than the basic shapes that are recognized onthe client

Appendix B: Glossary

audio-visualspeech

Combination of video and audio to process input (jointface/lips/movement recognition and speech recognition) and generateoutput (audio-visual media)

complementary use ofmodalities

A use of modalities where the interactions available to the userdiffer per modality.

composite inputs

Composite input is input received on multiplemodalities at the same time and treated as a single, integratedcompound input by downstream processes.

configuration

Seeexecutionmodel.

conflicting inputs

Contradictory inputs provided by the user indifferent modalities or on different devices. For examples, theymay indicate different exclusive selection.

context

A session context consists of the history ofthe interaction between the user and the multimodal system,including the input received from the user, the output presented tothe user, the current data model and the sequence of data modelchanges.

coordinationcapability

Capability of a multimodal system to combinemultimodal inputs into composite inputs based on an interpretationalgorithm that decides what makes sense to combine based on thecontext

CC/PP [ Composite Capability/PreferenceProfiles],

A W3C working group which is developing anRDF-based framework for the management of device profileinformation. For more details about the group activity please visithttp://www.w3.org/Mobile/CCPP/

concatenation

The text-to-speech engine concatenates shortdigital-audio segments and performs intersegment smoothing toproduce a continuous sound.

CSS

Cascading Stylesheets

data file

Argument files to input or output processing algorithms

defaultsynchronization

Synchronization behavior supported by defaultby a multimodal application.

delivery context

A set of attributes that characterizes thecapabilities of the access mechanism in terms of device profile,user profile (e.g. identify, preferences and usage patterns) andsituation. Delivery context may have static and dynamiccomponents.

device

A piece of hardware used to access and interactwith an application.

device profile

A particular subset of the delivery contextthat describes the device characteristics including for exampledevice form factor, available modalities, level of synchronizationand coordination.

DI [Device Independence]

The W3C Device Independence Activity is workingto ensure seamless Web access with all kinds of devices, andworldwide standards for the benefit of Web users and contentproviders alike. For more details pleases refer tohttp://www.w3.org/2001/di/

digital ink

Stored or recognized handwriting input.

directed dialog

A dialog in which one party (the user or the computer) follows apre-selected path, independent of the responses of the other. (cfr.mixed initiative dialog).

distributedcomponents

System components may live at various points ofthe network, including the local client.

DOM [DocumentObject Model]

A standard interface to the contents of a webpage. Please visithttp://www.w3.org/DOM/ for moredetails.

EMMA

Extensible MultiModal Annotation MarkupLanguage. Formerly known as NLSMLÂ—Natural LanguageSemantics Markup Language. This markup language is intended for useby systems to represent semantic interpretations for a variety ofinputs, including but not necessarily limited to, speech andnatural language text input

event

An event is a representation of someasynchronous occurrence of interest to the multimodal system.Examples include mouse clicks, hanging up the phone, speechrecognition errors. Events may be associated with data e.g. thelocation the mouse was clicked.

event handler

A software object intended to interpret andrespond to a given class of events.

event source

An agent (human or software) capable ofgenerating events.

execution model

Runtime configuration of the various systemcomponents in a particular manifestation of a multimodalsystem.

external event

External input events are events that are notoriginating from direct user input. External output events areevents that originate in the multimodal system and are handled byother processes.

GPS [GlobalPositioning System]

A worldwide radio-navigation system formed froma constellation of 24 satellites and their ground stations. GPSuses these "man-made stars" as reference points to calculatepositions accurate to a matter of meters.

grammar

A computational mechanism that defines a finiteor infinite set of legal strings, usually with some structure.

handwriting

use of the pen for input which is convertedinto text or symbols. Involves handwriting recognition.

history

Portions of profile and session contextpersisted for a same user across sessions.

HTML [HyperText MarkupLanguage]

A simple markup language used to createhypertext documents that are portable from one platform to another.To find more information about specification of HTML and theworking group acitivity please visithttp://www.w3c.org/MarkUp/

HTTP [Hypertext TransferProtocol]

To get details about the HTTP working group andthe HTTP specification please visithttp://www.w3c.org/Protocols/.

human language

Any spoken language (e.g. French, Japanese,English etc...).

ink

See digital ink.

input

Event, set of events or macro-event generatedby a user interaction in a particular modality on a particulardevice.

inputconstraints

Specify how inputs are can be combined via rules or interactionmanagement strategies. For example the markup language maycoordinates grammars for modalities other than speech with speechgrammars to avoid duplication of effort in authoring multimodalgrammars.

input processing

Algorithm to apply to a particular input inorder to transform or extract information from it (e.g. filtering,speech recognition; spaker recognition, NL parsing,...). Thealgorithm may rely on data files as argument (e.g. grammar,acoustic model, NL models, ...)

interaction manager

An interaction manager generates or updates thepresentation by processing user inputs, session context andpossibly other external knowledge sources to determine the intentof the user. An interaction manager relies on strategies todetermine focus and intent as well as to disambiguate, correct andconfirm sub-dialogs. We typically distinguishdirected dialogs (e.g. user-driven orapplication-driven) andmixedinitiative or free flow dialogs.

lipsynch

Output media where at least a face has lipmovements synchronized with an output audio speech

markup components

XML vocabularies that provide markup-levelaccess to various system components

media synchronization

Synchronization between output media asspecified by SMIL:http://www.w3.org/AudioVideo/

medium

It is a description that can be rendered intophysical effects that can be perceived and interacted with by theuser in one or multiple modalities and on one or multipledevices

MIDI

Musical Instrument Digital Interface, an audioformat.

mixed initiativedialog

A style of dialog where both parties (the computer and the user)can control what is talked about and when. A party may on its ownchange the course of the interaction (e.g., by asking questions,providing more or less information than what was requested ormaking digressions). Mixed initiative dialog is contrasted withdirected dialog where only one party controls the conversation. (cfdirected dialog)

MMI: [MultimodalInteraction]

A W3C Working Group which is developing markupspecifications that extends the Web user interface to allowmultiple modes of interaction. For more details of MMI workinggroup and MMI activity, please visithttp://www.w3c.org/2002/mmi/

modality

The type of communication channel used for interaction. It alsocovers the way an idea is expressed or perceived, or the manner inwhich an action is performed.

modality switch

Change of modality to perform a particularinteraction. It can be decided by the user or imposed by theapplication or runtime (e.g. when a phone call drops).

MPEG

Working group established under the jointdirection of the International Standards Organization/InternationalElectrotechnical Commission (ISO/IEC), that has for goal to createstandards for the digital video and the audiophonic compression.More precisely, MPEG defines the syntax of audio and video formatneeding low data rates, as well as operations to be undertaken bydecoders.

MP3 [MPEG Audio Layer-3]

An Internet music format. For MP3 relatedtechnologies please refer tohttp://www.mp3-tech.org/

multimodal system

A multimodal system supports communication withthe user through different modalities such as voice, gesture, andtyping. (cfr modality)

must specify

A must specify requirement must be satisfied bythe multimodal specification(s), starting from their very firstversion.

natural Language (NL)

Term used for human language, as opposed toartificial languages (such as computer programming languages orthose based on mathematical logic). A processor capable of handlingNL must typically be able to deal with a flexible set ofsentences.

natural languagegeneration (NLG)

A technique for generating natural languagesentences based on some higher-level information. Generation bytemplate is an example of simple language generation techniques.The flight from <departure-city> to <arrival-city>leaves at <departure-time> is an example of template wherethe slots indicated by <Â…> have to be filledwith the appropriate information by a higher-level process.

natural languageprocessing

Natural language understanding, generation,translation and other transformations on human language.

natural language understanding(NLU)

The process of interpreting natural languagephrases to specify their meaning, typically as a formula in formallogic.

nice to specify

A "nice to specify" requirement will be takeninto account when designing the specification. If a technicalsolution is available, the specifications will try to satisfy therequirement or support the feature, provided that it does notexcessively delay the work plan.

notify

The act of communicating an event (seesubscribe).

override mechanism forsynchronization

Information that specifies how thesynchronization should behave when not following its defaultbehavior. (cf. default synchronization)

output generation

Expressing information to be conveyed in auser-friendly form, possibly using multiple output mediastreams.

output processing

Algorithm to apply in order to transform orgenerate an output (e.g. TTS, NLG)

semantics

The meaning or interpretation of a word,phrase, or sentence, as opposed to its syntactic form. In naturallanguage and dialog technology the term semantics is typically usedto indicate a representation of a phrase or a sentence whoseelements can be related to entities of the application (e.g.departure airport and arrival time for a flight application), ordialog acts (e.g. request for help, repeat, etc.).

semanticinterpretation

The process of interpreting the semantic partof a grammar. The result of the interpretation is a semanticrepresentation. This process is often referred as SemanticTagging.

semantic representation

The semantic result of parsing a writtensentence, or a spoken utterance. The semantic interpretation can beexpressed as attribute value pairs or more complex structures. W3Cis working on the definition of Semantic Representationformalism

sequential inputs

A sequential input is one received on a singlemodality. The modality may change over time.] (cf.simultaneous orcomposite input.

sequentialmultimodality

A sequential multimodal application is one inwhich the user may interact with the application only one modalityat a time,switching betweenmodalities as needed.]

session

The time interval during which an applicationand its context context is associated to a user and persisted.Within a session, users may suspend and resume interaction with anapplication within a same modality or device or switch modality ordevice.

session level synchronizationgranularity

Multimodal application that supports suspendand resume behavior across modalities

should specify

The specifications (multimodal markup languageand other) will aim at addressing and satisfying the requirement orsupporting the features during the lifetime of the working group.Early specification will take this into account to allow easy andinteroperable updates.

simultaneousinputs

Simultaneous inputs denote inputs that can comefrom different modalities but are not combined into compositeinputs. Simultaneous multimodal inputs, imply that the inputs fromseveral modalities are interpreted one after the other in the orderthat they where received instead of being combined beforeinterpretation.

situation

External information that can affect the usageor expected behavior of multimodal applications including forexample on-going activities (e.g. walking versus driving),environment (e.g. noisy), privacy (e.g. alone versus in public),etc...

SMIL[Synchronized Multimedia Integration Language]

A W3C Recommendation, SMIL 2.0 enables simpleauthoring of interactive audiovisual applications. Seehttp://www.w3.org/TR/smil20/for details.

speech recognition

The ability of a computer to understand thespoken word for the purpose of receiving command and data inputfrom the speaker.

speech-recognitionengine

A software/hardware component that performsrecognition from a digital-audio stream. speech recognition enginesare supplied by vendors who specialize in the software.

subscribe

The act of informing an event source that youwant to be notified of some class of events.

supplementary use ofmodalities

Describes multimodal applications in whichevery interaction (input or output) can be carried through in eachmodality as if it was the only available modality

suspend and resume

Suspend and resume behavior; an applicationsuspended in one modality can be resumed in the same or anothermodality

synchronizationbehavior

Way that an input in one modality is reflectedin the output in another modality/device as well as way that it maybe combined across modalities (coordination capability)

synchronization granularity orlevel

Event-level synchronization: Inputs in onemodality are captured at the level the individual DOM events andimmediately reflected in the other modality; when it makessense
Field-level synchronization: Inputs in onemodality are reflected in the other after the user changes focus(e.g. moves from input field to input field) or completes theinteraction with a field (e.g. completes a select in a menu)
Form-level synchronization: Inputs in onemodality are reflected in the other only after a particular pointin the presentation is reached (e.g. after a certain number offields have been completed in the form).
Session-level synchronization: Inputs in onemodality are reflected in the other only after a switch from onemodality to another.

synthesis

The text-to-speech engine synthesizes theglottal pulse from human vocal cords and applies various filters tosimulate throat length, mouth cavity, lip shape, and tongueposition.

text-to-speech

Technologies for converting textual (ASCII)information into synthetic speech output. Used in voice-processingapplications requiring production of broad, unrelated, andunpredictable vocabularies, such as products in a catalog or namesand addresses. This technology is appropriate when system designconstraints prevent the more efficient use of speech concatenationalone.

timestamping

Annotation of an event that characterize therelative (with respect to an agreed upon reference) or absolutetime of occurrence of the event

TTS

text-to-speech

turn

Set of input collected from the user beforeupdating the output

URI

Uniform Resource Identifier -http://www.w3.org/Addressing/

user profile

A particular subset of the delivery contextthat describes the user including for example the identity,personal information, personal preferences and usagepreferences.

XMLEvent

An XML Events module that provides XMLlanguages with the ability to uniformly integrate event listenersand associated event handlers with DOM Level 2 event interfaces.The result is to provide an interoperable way of associatingbehaviors with document-level markup. For XML Event specificationplease visithttp://www.w3.org/TR/2001/WD-xml-events-20011026/Overview.html#s_intro

XSL

Extensible Stylesheet Language

XSLT

Extensible Stylesheet LanguageTransformations

[8]ページ先頭

Movatterモバイル変換

Multimodal Interaction Requirements

W3C NOTE 8 January 2003

Abstract

Status of this Document

Table of contents

Introduction

Multimodal interactions

1. General Requirements

1.1Scalability across wide range of device capabilities

1.2Supplementary and complementary use of different modalities

1.3 Seamlesssynchronization of modalities

1.4 Multilingual support

1.5 Easy to implement

1.6 Accessibility

1.7 Security and privacy

1.8 Delivery and context

1.9 Navigation specification

2. Input ModalityRequirements

2.1 Input processing

2.2 Sequential multimodalinput

2.3 Simultaneous multimodalinput

2.4 Composite multimodalinput

2.5 Input modes supported

2.5.1 MUST specify

2.5.2 NICE to specify

2.5.3 Extensibility

2.6 Semantics ofinput generated by UI components

2.7 Coordinatedconstraints

2.8Support for conflicting input from different modalities

2.9 Temporal positioning ofinput events

3. Output Media Requirements

3.1 Sequential media output

3.2. Simultaneous mediaoutput

3.3 Supported output medias

3.3.1 MUST specify

3.3.2. Nice to specify

3.3.3. Extensibility

3.4 Output processing

4.Architecture, integration and synchronization points

4.1 Reuse standard markuplanguages

4.2 XHTML Modularization

4.3 Separation of data model,presentation layer and application logic

4.4 Detection of availablemodalities and changes

4.5 Synchronizationgranularities

4.6 Independent input and outputinterfaces even in a same modality

4.7 Distributedsynchronization

4.8 Distributed processing

4.9 External input and output

4.10 Temporalpositioning of input and output events

5. Runtimes and deployments

5.1 Configurations

5.2 Mobile deployments

5.3 EMMA

5.4 Multimodalsynchronization exchanges

6. References

7. Acknowledgements

Appendices

Appendix A: Use cases

A.1 Overview of the use cases

Thin Client

Thick Client

Medium Client

Scenario Details

Scenario Details

Scenario Details

A.2 Event analysis

Appendix B: Glossary