Movatterモバイル変換

Multimodal Interaction Use Cases

W3C NOTE 4 December 2002

This version:: http://www.w3.org/TR/2002/NOTE-mmi-use-cases-20021204/
Latest version:: http://www.w3.org/TR/mmi-use-cases/
Previous version:: this is the first publication
Editors:: Emily Candell, Dave Raggett

Abstract

The W3CMultimodalInteraction Activity is developing specifications as a basisfor a new breed of Web applications in which you can interact usingmultiple modes of interaction, for instance, using speech, handwriting, and key presses for input, and spoken prompts, audio andvisual displays for output. This document describes several usecases for multimodal interaction and presents them in terms ofvarying device capabilities and the events needed by each use caseto couple different components of a multimodal application.

Status of this Document

This section describes the status of this document at thetime of its publication. Other documents may supersede thisdocument. The latest status of this document series is maintainedat theW3C.

W3C's MultimodalInteraction Activity is developing specifications for extendingthe Web to support multiple modes of interaction. This documentdescribes several use cases as the basis for gaining a betterunderstanding of the requirements for multimodal interaction, andthe kinds of information flows needed for multimodalapplications.

This document has been produced as part of theW3C Multimodal InteractionActivity,following the procedures set out for theW3C Process. Theauthors of this document are members of theMultimodal InteractionWorking Group (W3C Membersonly). This is a Royalty Free Working Group, as described inW3C'sCurrentPatent Practice NOTE. Working Group participants are requiredto providepatentdisclosures.

Please send comments about this document to the public mailinglist:www-multimodal@w3.org (publicarchives). To subscribe, send an email to <www-multimodal-request@w3.org>with the wordsubscribe in the subject line(include the wordunsubscribe if you want tounsubscribe).

A list of current W3C Recommendations and other technicaldocuments including Working Drafts and Notes can be found athttp://www.w3.org/TR/.

1. Introduction

Analysis of use cases provides insight into the requirements forapplications likely to require a multimodal infrastructure.

The use cases described below were selected for analysis inorder to highlight different requirements resulting fromapplication variations in areas such as device requirements, eventhandling, network dependencies and methods of user interaction

It should be noted that although the results of this analysis webe used as input to the Multimodal Specification being developed bythe W3C Multimodal Interaction Working Group, there is no guaranteethat all of these applications will be implementable using thelanguage defined in the specification.

1.1 Use Case Device Classification

Thin Client

A device with little processing power and capabilities that canbe used to capture user input (microphone, touch display, stylus,etc) as well as non-user input such as GPS. The device may have avery limited capability to interpret the input, for example a smallvocabulary speech recognition, or a character recognizer. The bulkof the processing occurs on the server including natural languageprocessing and dialog management.

An example of such a device may be a mobile phone with DSRcapabilities and a visual browser (there could actually be thinnerclients than this).

Thick Client

A device with powerful processing capabilities, such that mostof the processing can occur locally. Such a device is capable ofinput capture and interpretation. For example, the device can havea medium vocabulary speech recognizer, a handwriting recognizer,natural language processing and dialog management capabilities. Thedata itself may still be stored on the server.

An example of such a device may be a recent production PDA or anin-car system.

Medium Client

A device capable of input capture and some degree ofinterpretation. The processing is distributed in a client/server ora multidevice architecture. For example, a medium client will havethe voice recognition capabities to handle small vocabulary commandand control tasks but connects to a voice server for more advanceddialog tasks.

1.2 Use Case Summaries

Table 1:Form Filling forair travel reservation

Description	Device Classification	Device Details	Execution Model
The means for a user to reserve a flight using a wirelesspersonal mobile device and a combination of input and outputmodalities. The dialogue between the user and the application isdirected through the use of a form-filling paradigm.	Thin and medium clients	touch-enabled display (i.e., supports pen input), voice input,local ASR and Distributed Speech Recognition Framework, localhandwriting recognition, voice output, TTS, GPS, wirelessconnectivity, roaming between various networks.	Client Side Execution

Scenario Details

User wants to make a flight reservation with his mobile devicewhile he is on the way to work. The user initiates the service viameans of making a phone call to a multimodal service (telephonemetaphore) or by selecting an application (portal environmentmetaphore). The details are not described here.

As the user moves between networks with very differentcharacteristics, the user is offered the flexibility to interactusing the preferred and most appropriate modes for the situation.For example, while sitting in a train, the use of stylus andhandwriting can achieve higher accuracy than speech (due tosurrounding noise) and protect privacy. When the user is walking,the input and output modalities that more appropriate would bevoice with some visual output. Finally, at the office the user canuse pen and voice in a synergistic way.

The dialogue between the user and the application is driven by aform-filling paradigm where the user provides input to fields suchas "Travel Origin:", "Travel Destination:", "Leaving on date","Returning on date". As the user selects each field in theapplication to enter information, the corresponding inputconstraints are activated to drive the recognition andinterpretation of the user input. The capability of providingcomposite multimodal input is also examined, where input frommultiple modalities is combined for the interpretation of theuser's intent.

Table 2:DrivingDirections

Description	Device Classification	Device Details	Execution Model
This application provides a mechanism for a user to request andreceive driving directions via speech and graphical input andoutput	Medium Client	on-board system (in a car) with a graphical display, mapdatabase, touch screen, voice and touch input, speech output, localASR and TTS Processing and GPS.	Client Side Execution

Scenario Details

User wants to go to a specific address from his current locationand while driving wants to take a detour to a local restaurant (Theuser does not know the restaurant address nor the name). The userinitiates service via a button on his steering wheel and interactswith the system via the touch screen and speech.

Table 3:NameDialing

Description	Device Classification	Device Details	Execution Model
The means for users to call someone by saying their name.	thin and fat devices	Telephone	The study covers several possibilities: whether the application runs in the device or the server whether the device supports limited local speechrecognition These choices determine the kinds of events that are needed tocoordinate the device and network based services.

Description

Device Classification

Device Details

Execution Model

The means for users to call someone by saying their name.

thin and fat devices

Telephone

The study covers several possibilities:

whether the application runs in the device or the server
whether the device supports limited local speechrecognition

These choices determine the kinds of events that are needed tocoordinate the device and network based services.

Scenario Details

Janet presses a button on her multimodal phone and says one ofthe following commands:

Call Wendy
Call Wendy on her cell phone
Call Wendy at work
Call Wendy Smith at Acme Research

The application initially looks for a match in Janet's personalcontact list and if no match is found then proceeds to look inother directories. Directed dialog and tapered help are used tonarrow down the search, using aural and visual prompts. Janet isable to respond by pressing buttons, or tapping with a stylus, orby using her voice.

Once a selection has been made, rules defined by Wendy are usedto determine how the call should be handled. Janet may see apicture of Wendy along with a personalized message (aural andvisual) that Wendy has left for her. Call handling may depend onthe time of day, the location and status of the both parties, andthe relationship between them. An "ex" might be told to never callagain, while Janet might be told that Wendy will be free in half anhour after Wendy's meeting has finished. The call may beautomatically directed to Wendy's home, office or mobile phone, orJanet may be invited to leave a message.

2. Use Case Details

2.1 Use-case: Form filling for air travelreservation

Description: The air travel reservation use case describes ascenario in which the user books a flight using a wireless personalmobile device and a combination of input and output modalities.

The device has a touch-enabled display (i.e., supports peninput) and it is voice enabled. The use case describes a richmultimodal interaction model that allows the user to start asession while commuting on the train, continue the interactionwhile walking to his office and complete the transaction while satat his office-desk. As the user moves between environments withvery different characteristics, the user is given the opportunityto interact using the preferred and most appropriate modes for thesituation. For example, while sitting in a train, the use of stylusand handwriting can offer higher accuracy than speech (due tonoise) and protect privacy. When the user is walking, the input andoutput modalities more appropriate would be voice with some visualoutput. Finally, at the office the user can use pen and voice in asynergistic way.

This example assumes the seamless transition through a varietyof connectivity options such as high bandwidth LAN at the office(i.e., 802.11), lower bandwidth while walking (i.e., cellularnetwork such as GPRS) and low bandwidth but in additionintermittent connectivity while on the train (e.g., can getdisconnected when going through a tunnel). The scenario also takesadvantage of network services such as location and time.

Actors

User who makes the air travel reservation
Mobile device with touch-enabled display wireless networkconnectivity, handwriting recognition capability and limited voicerecognition capability on the device.
Network service with full voice dialog capabilities, connectionto travel reservation database and location/time services.

Additional Assumptions

Data capabilities are available on the communicationsprovider's network. Voice requirements are satisfied either viavoice capabilities on the communications provider network orthrough a DSR framework that utilized the existing datacapabilities.
There are means for describing user and device profileinformation and means of exchanging this information between serverand client.

Table 4: Event Table

User Action	Action on device	Events sent from device	Action on server	Events sent From server
Device turned on	Registers with network and uploads delivery context [availableI/O modalities, bandwidth, user-specific info (e.g., homecity)]	register_device (delivery_context)	Complete session initiation by registering device and deliverycontext (init_session)	register_ack
User picks travel app (taps with stylus or says travel)	Client side of application is started	app_connect (app_name)	Loads a page that is appropriate to current profile	app_connect_ack (start_page)
Application is running and ready to take input. Origin city wasguessed from user profile or location service. User is o the train.Active I/O modalities are pen, display and audio output.
User picks a field in the form to interact with the stylus	Destination field gets highlighted	on_focus (field_name)	Server loads the appropriate constraints for input on thisfield. Constraints are sent to device for hwr.	listen_ack (field_grammar)
User starts writing. When he is finished	Handwriting recognition performed locally with visual and audiopresentation of result (i.e., earcon)
If recognition confidence is low, a different earcon is playedand pop-up menu of top-n hypotheses is displayed.
User approves result by moving to next field with stylus (e.g.,departure time)	Result is submitted to server. Time field is highlighted.	submit_partial (destination) on_focus (field_name)	Dialog state is updated. Appropriate constraints for input onthis field are loaded. Grammar constraints are sent to thedevice	listen_ack (field_grammar)
User gets off the train and starts walking - I/O modality isvoice only
User explicitly switches profile via button press, or throughnon-user sensory input the profile is changed	Profile update - only voice enabled input with voice and visualoutput	update (delivery_context)	Speech recognition and output module initialization.Synchronization of dialog state between modalities. Audio prompt"what time do you want to leave" is generated).	send (autio_prompt)
In response to audio prompt, user says "I want a flight in themorning".	Audio is collected and sent it to server through data or voicechannel	send (audio)	Recognizes voice and generates list of hypotheses. Correspondingaudio prompt is created (e.g., "would you like to flight at 10 or11 in the morning").	send (audio_prompt)
While walking, field selection is either driven by the dialogengine on the server, or by the user uttering simple phrases (e.g.,voice graffiti)
User reaches his office.	User explicitly switches profile via button press, or throughnon-user sensory input the profile is changed.	Events an handlers as previously for changing the deliverycontext to accommodate interaction via voice, pen and GUIselection
At this point in the dialogue, it has been determined that thereare no direct flights between origin and destination. Theapplication displays available routes with in-between stops on amap and the user is prompted to select one.
User says "I would like to take this one" while making a pengesture (i.e., circling over the preferred route)	Ink and audio are collected and sent to the server with timestamp information.	send (audio) send (ink)	Server receives the two inputs and integrates them into asemantic representation Server updates app with selection, acknowledging that inputintegration was possible.	completeAck
At this point in the dialog, payment authorization needs to bemade. User enters credit card information via voice, pen orkeypad.
User provides signature for authorization purposes	Ink is collected with information about pressure and tilt.	send (ink)	Server verifies signature.	DONE

2.2 Use-case: Driving Directions

Assumptions

ASR services are local for simple requests (e.g. sessionpreference setup)
ASR is server-based for complex requests (e.g. addresses)
TTS local
Execution model is hosted on the device.
single language - with acknowledgement that we will ultimatelyneed language selection
availability (always on) - with acknowledgement that there maybe temporary interruptions due to unexpected circumstances (e.g.tunnels, mountains)
driver is alone [cannot get assistance]
Additional applications may be available when the service isinitiated via a service selection menu (this is beyond the scope ofthis use case analysis)
Initiating recognition requires a single button press. Buttonpress indicating end of speech is optional assuming withpreconfigured timeout to stop listening (requiring the user to holddown a button while driving may be dangerous)
At any time during the session, the user may change displayoptions via the touch screen (includes zooming in and changingroute display options). Display options may also be changed usingspeech by initiating a dialog by pressing the button on thesteering wheel

Actors

Primary Device:

on-board system (in a car) with the following capabilities:
- graphical display:
  - maps
  - Estimated time of arrival
  - Textual Directions
- touch screen
- voice (input and output)
- keyboard/text input
- local ASR and TTS processing
- access to remote servers (ASR and App Server)
- GPS

Data sources:

route database
traffic conditions
GPS data
speedometer
landmarks database and places of interest:
- nearest gas station
- nearest restaurant of a specific type
User Preference Database

Scenario Walkthrough (User point ofview)

User preferences (These may be changed on a per session basis):

Primary Input: Speech
Secondary Input: Touch Screen
Speech and Graphical Output
Preferences are stored on the server to enable multiple usersto use the same device (Preferences may be retrieved automaticallybased on speaker identification or key identification eliminatingthe need for an authentication dialog)

User wants to go to a specific address from his current locationand while driving wants to take a detour to a local restaurant (Theuser does not know the restaurant address nor the name)

Table 5: Event Table

User Action/External Input	Action on Device	Event Description	Event Handler	Resulting Action
User presses button on steering wheel	Service is initiated and GPS satellite detection begins	HTTP Request to app server	App server returns initial page to device	Welcome prompts are played. Authentication dialog is initiated(may be initiated via speaker identification or keyidentification).
User interacts in an authentication dialog	Device executes authentication dialog using local ASRprocessing	HTTP Request to app server which includes user credentials	App server returns initial page to device including userpreferences	User is prompted for a destination (if additional services areavailble after authentication, assume that user selects drivingdirection application)
Initial GPS Input	N/A	GPS_Data_In Event	Device handles location information	Device updates map on graphical display (assumes all maps arestored locally on device)
User selects option to change volume of on-board unit usingtouch display.	N/A	Touch_screen_event (includes x, y coordinates)	Touch screen detects and processes input	Volume indicator changes on screen. Volume of speech output ischanged
User presses button on steering wheel	Device initiates connection to ASR server	Start_Listening Event	ASR Server receives request and establishes connection	"listening" icon appears on display (utterances prior toestablishing the connection are buffered)
User says destination address (may improve recognition accuracyby sending grammar constraints to server based on a local dialogwith the user instead of allowing any address from the start)	N/A	N/A	ASR Server processes speech and returns results to device	Device processes results and plays confirmation dialog to userwhile highlighting destination and route on graphical display
User confirms destination	Device performs ASR Processing locally. Upon confirmation,destination info is sent to app server	HTTP Request is sent to app server (includes current locationand destination information)	App Server processes input and returns data to device	Device processes results and updates graphical display withroute and directions highlighting next step
GPS Input at regular intervals	N/A	GPS_Data_In Event	Device processes location data and checks if location milestoneis hit	Device updates map on graphical display (assumes all maps arestored locally on device) and highlights current step. Whenmilestone is hit, next instruction is played to user
GPS Input at regular intervals (indicating driver is offcourse)	N/A	GPS_Data_In Event	Device processes location data and determines that user is offcourse	Map on graphical display is updated and textual message isdisplayed indicating that route is not correct. Prompt is playedfrom the device indicating that route is being recalculated
N/A	Route request is sent to app server including new locationdata	HTTP Request is sent to app server (includes current locationand destination information)	App Server processes input and returns data to device	Device processes results and updates graphical display withroute and directions highlighting next step
Alert received on device based on traffic conditions	N/A	Route_Change Alert	Device processes event and initiates dialog to determine ifroute should be recalculated	User is informed of traffic conditions and asked whether routeshould be recalculated.
User requests recalculation of route based on current trafficconditions	Device performs ASR Processing locally. Upon confirmation,destination info is sent to app server	HTTP Request is sent to app server (includes current locationand destination information)	App Server processes input and returns data to device	Device processes results and updates graphical display withroute and directions highlighting next step
GPS Input at regular intervals	N/A	GPS_Data_In Event	Device processes location data and checks if location milestoneis hit	Device updates map on graphical display (assumes all maps arestored locally on device) and highlights current step. Whenmilestone is hit, next instruction is played to user
User presses button on steering wheel	Connection to ASR server is established	Start_Listening Event	ASR Server receives request and establishes connection	User hears acknowledgement prompt for continuation, and"listening" icon appears on display
User requests new destination by destination type while stilldepressing button on steering wheel (may improve recognitionaccuracy by sending grammar constraints to server based on a localdialog with the user)	N/A	N/A	ASR Server processes speech and returns results to device	Device processes results and plays confirmation dialog to userwhile highlighting destination and route on graphical display
User confirms destination via a multiple interaction dialog todetermine exact destination	Device executes dialog based on user responses (using local ASRProcessing) and accesses app server as needed	HTTP requests to app server for dialog and data specific touser response	App server responds with appropriate dialog	User interacts in a dialog and selects destination. User isasked whether this is a new destination
User indicates that this is a stop on the way to originaldestination	Devices sends updated destination information to appserver	HTTP Request for updated directions (based on current location,detour destination, and ultimate destination)	App Server processes input and returns data to device	Device processes results and updates graphical display with newroute and directions highlighting next step
GPS Input at regular intervals	N/A	GPS_Data_In Event	Device processes location data and checks if location milestoneis hit	Device updates map on graphical display (assumes all maps arestored locally on device) and highlights current step. Whenmilestone is hit, next instruction is played to user

Protocols:

HTTP
Proprietary protocol for connection to ASR server?
GPS
Others

Events:

ASR Events
Touch Screen Events
GPS Updates
Refresh Triggers
Traffic Alerts
Others???

Synchronization Issues:

Spoken Directions must be synchronized with currentlocation
When route changes while prompts are playing, current promptsmust be stopped and new prompts queued. This may be triggered bythe following:
- BSW pressed by user
- Screen is touched
- Traffic Update event is received
- Driver Error
Screen must be updated to reflect current location and route.This may be triggered by:
- Refresh Event
- Change of destination
- Change of route
- Driver Error
Asynchronous events such as traffic updates need to besynchronized with explicit user requests including:
- Route change requests
- Display/Output Preference change requests
Others???

Latency Concerns

Unanticipated app Server delays may cause directions to beinaccurate

Scenario Considerations

Input Information:

Starting address/location:
- explicit street address
- current location obtained via GPS
- landmark or place of interest
Ending address/location:
- explicit street address
- landmark or place of interest
Traffic Conditions
General preferences:
- highway vs. scenic route
- time vs. distance
- style of output (graphical, turn-by-turn, etc...)
- units of output (miles vs. kilometers)

Possible Devices:

Phone with display
Phone without display (voice only)
In-dash system (GPS, ASR, TTS)
PC
PDA
Phone (voice + data)
UMTS

Available Technologies:

Communication (2.5G, 3G)
Display (Y/N)
Application run-time environment (BREW, J2ME, etc)
Server access

Data sources:

route database
traffic conditions
location [GPS]
speed and time of arrival [GPS, speedometer]
landmarks database and places of interest:
- nearest gas station
- nearest restaurant of a specific type
User Preference Database

Output Mechanisms:

graphical (map)
text description
voice
fax
dynamic updates (recalculation based on traffic information,driver error, etc...)
single delivery of results vs. multiple/sequential delivery ofresults as needed

2.3 Use Case: Multimodal Name Dialling UseCase

Overview

The Name Dialing use case describes a scenario in which userscan say a name into their mobile terminals and be connected to thenamed person based on the called party's availability for thatcaller.

If the called user is not available, the calling user may begiven the choice of either leaving a message on the called user'svoicemail system or sending an email to the called user. The calleduser may provide a personalized message for the caller, including,for example, "Don't ever call me again!"

The called user is given the opportunity of selecting whichdevice the call should be routed to, e.g. work, mobile, home, orvoice mail. This may be dependent on the time of day, the calleduser's location, and the identity of the calling user.

The use case assumes a rich model of name dialling as an exampleof a premium service exploiting a range of information such aspersonal and network directories, location, presence, buddy listsand personalization features.

The benefits of making this a multimodal interacton include theability to view and listen to information about the called user,and to be able to use a keypad or stylus, as an alternative tousing voice as part of the name selection process.

Actors

Caller — user who wishes to place a call
Called user — user who wishes control over how incomingcalls are handled
Mobile display phone with a lightweight client browser, andoptional speaker-dependent minimal speech recognitioncapabilities
Network based directory service with speech recognitioncapabilities, this provides support for looking up names inpersonal contact lists, as well as in corporate and publicdirectories
Network based unified messaging service with provision forcomposing, transferring and playing back messages, includingpersonalized messages intended for specific callers
User profile database with presence information, buddy lists,and personalized call handling rules

Assumptions

The user has a device with a button that is pushed to place acall. The device has recording capabilities. [voice activation ispower hungry and unreliable in noisy environments]

Both voice and data capabilities are available on thecommunications provider's network (not necessarily assimultaneously active modes).

If the phone supports speech recognition and there is a localcopy of the personal phone contact list, then the user's spokeninput is first recognized against the local directory for apossible match and if unsuccessful, the request is extended back tothe directory provider.

The directory provider has access to a messaging service and touser profiles and presence information. The directory provider thusknows the whereabouts of each registered user - on the phone, atwork, unavailable etc.

The directory provider enforces access control rules to ensureindividual and corporate privacy. This isn't explored in this usecase.

People can be identified by personal names like "Wendy" or bynick names or aliases. The personal contact list provides a meansfor subscribers to define their own aliases, and to limit the scopeof search (there are a lot of Wendy's worldwide).

There is a user agent on the client device with an XHTML browserand optional speaker-dependent speech recognition capabilities.

There is a client server relationship between the user agent onthe device and the directory provider.

The dialog could be driven from either the client device or fromthe network. This doesn't effect the user view, but does alter theevents used to coordinate the two systems. This will be explored ina later section.

The Name Dialing use case will be described through thefollowing views:

User view

User pushes a button and says

  "Call Wendy Smith"

It is also possible to say such things as:

  "Call Wendy"  "Call Wendy Smith at work".  "Call Wendy at home".  "Call Wendy Smith on her mobile phone".

Multiple scenarios are possible here:

If local recognition is supported, the utterance will be firstprocessed by a local name dialling application. If there is nomatch, the recorded utterance is forwarded to a network based namedialling application.

The user's personal contact list will take priority over thecorporate and public directories. This is independent of whetherthe personal list is held locally in the device or in thenetwork.

The following situations can arise when the user says aname:

Single match — the caller is presented with informationabout the callee. This may include a picture taken from thecallee's profile. The caller is asked for a confirmation before thecall is put through.
Multiple matches — if the number of matches is small(perhaps five or less), the caller is asked to choose from thelist. This is be presented to the caller via speech and accompaniedwith a display of a list of names and pictures. The caller canthen:
- Use a button on the phone to select a list item.
- Point or touch a link on the screen in the presented list.
- Say index number or expanded name from the presented list.
A further alternative is to say "that one" as the system speakseach item in the list in sequence. This method is offered in casethe user needs hands and eyes free operation, or the device isincapable of displaying the list
Lot's of matches, for example, when the caller says a commonname. The caller is led through a directed dialog to narrow downthe search.
No recognition — the recognizer wasn't able to find amatch. The user could have failed to say anything, or there couldhave been too much noise. A tapered help mechanism is invoked.Callers could be asked to repeat themselves, or asked to key in thenumber or speak it digit by digit.

Assuming that the user successfully makes a selection:

The system retrieves further information on the called user suchas the current location and local time of that user. Theinformation presented may depend on the relationship between thecalled and calling users. This assumes support for a buddy list andpresence capability. The called user may specify her availabilityfor specific individuals or groups of would be callers depending ontime of day etc.
Two scenarios are described here:
1. The system finds that the called person is currently available.A picture and/or sound bite is provided to the caller. The systemplaces the call and the user is connected to Wendy Smith.
  Post condition: The user is in a call with the intendedparty.
2. The system finds that the called person is unavailable. Thesystem attempts to connects to the called user's voicemailsystem.
  Assuming this succeeds, the system plays the following promptback to the caller: "Wendy Smith is currently unavailable. She hasleft this message for you."
  The message is played out. It could be a multimedia message withrecorded sound, text, pictures and even short video clips.
  The system plays a prompt back - "Would you like to leave amessage?"
  The user says "Yes".
  The user is then connected to the voicemail system and leaves amessage for Wendy Smith.
  If Wendy's voicemail box is full or unavailable, the systemoffers the caller the chance of composing an email. This occupiesthe caller's storage allocation until it has been sent.
  Post condition: The user has left a message for theintended party.

The availability of the called user may depend on the time ofday, whether the called user is away from her work or homelocation, and who the calling user is. For example, when travellingyou may want to take calls on your mobile during the day. Don't youhate it when people call you in the middle of the night becausethey don't realize what timezone you are in! You may want to makean exception for close friends and family members. There may alsobe some people whom you never want to accept calls from, not evenvoice messages!

When a user is notified of an incoming call, the device maypresent information on the caller including a photograph, name,sound bite, location and local time information, depending on therelationship between the caller and callee. The user then has anopportunity to accept the call or to divert it to voice mail.

Directory provider View

The client on the user device records the spoken input. Thespoken input is recognized against the directory on client device.When this fails, the utterance is extended to the directoryprovider for recognition.
If the user device doesn't support local recognition, it maystill need to record the utterance, so that the user can starttalking immediately without needing to wait for the connection tothe directory provider to be completed.
The directory provider retrieves the profile for the callinguser. This has information on which device the user is callingfrom, the current location of the calling user etc. The callinguser is authenticated and authorized.
The recognizer in the provider recognizes the spoken utteranceand returns the result. This result can either be a single entry ora list of possible close matches.
The server application (hosting the directory provider) nowcontrols the flow of the interaction henceforth.
The server goes to the database and retrieves more informationbased on the recognizer result.
The provider queries the presence of the called user, andpersonalization information (buddy list, location and presenceinformation, etc.) to construct the content for the response.
A result may be returned back to the client device in more thanone way here:
A single XHTML page is constructed with both visual picture andaudio with the complete name of the recognized match.
The feedback can include two channels such as visual for thepicture and a separate voice channel for playing back the name ofthe user (an optimization for reduced latency).
The server creates and transfers a composed page to the clientdevice.
Once the client receives the content from the applicationserver, multiple scenarios are possible here based on therecognizer result. See user view for details.
Picking a choice from a list can be done by voice, button orstylus. The user should be able to browse the list, and to revisitthe list upon rejecting a confirmation of a preceding choice.
Example: user says "Call the first one". This utterance isprocessed by the directory provider to select the first match.
The directory application may need to apply a directed dialog tonarrow the search when there are more than a few matches, or whenrecognition and tapered help needs to be offered.

What is driving the dialog?

The details of the events depend on whether the dialog is beingdriven from the network or from the user device.

When the device sends a spoken utterance to the server, the usermay have spoken a name such as "Tom Smith" or spoken a command suchas "the last one". If the directory search is being driven by theuser device, the server's response is likely to be a short list ofmatches, or a command or error code. To support the application,the server would provide a suite of functions, including the meansfor the device to set the recognition context, the ability to playspecific prompts, and to download information on named users.

If the network is driving the dialog, the device sends thespoken utterance in the same way, but the responses are actions toupdate the display and local state. If the caller presses a buttonor uses a stylus to make a selection, this event will be sent tothe server. The device and server could exchange low level events,such as a stylus tap at a given coordinate, or higher level eventssuch as which name the user has selected from the list.

Table 6: Event Table

User action	Action on device	Events sent from device	Action on server	Events sent from server
Turns on the device	Registers with the Directory Provider through the operator inthe NW and downloads the personal directory	register user (userId)	Directory Provider gets register information, updates user'spresence and location info, loads user's personal info (buddy list,personal directory,...)	acknowledgement + personal directory In practice, SyncML would be used to reduce nettraffic
Pushes a button to place a call	Local reco initialized, activates the personal directory
	Displays a prompt "Please say a name"
Speaks a name	Local recognition against personal directory
a) If grammar matches:
	Display the name or namelist (see following table)
Confirms by pressing the call button again if 1 name isdisplayed, or selects a name on the list (see following table)	Fetches the number from the personal directory	call(userID, number)	Checks the location and presence status of the called party	call ok(picture) OR called party not available
	if call ok, displays the picture and places a call, if called party not available, displays/plays a correspondingprompt about leaving a message or sending an e-mail
i) if user chooses to leave a message:
User agrees to leave a message by pressing a suitable button	Initializes the recording, displays a prompt to start therecording
User speaks and ends by pressing a suitable button	Closes the recording, sends the recording to the DirectoryProvider app	leave message(userID, number, recording)	Stores the message for the called party	message ok
ii) if user chooses to send an e-mail:
User selects 'send e-mail' option by pressing a suitablebutton	Starts an e-mail writing application
Writes e-mail	Fetches the e-mail address from the personal directory, sendse-mail, closes the e-mail app	send mail(userID, mail address, text)	Sends the e-mail to the called party	mail ok
b) if personal grammar does not match:
	sends the utterance to be recognized in the network	send(userID, utterance)	Recognition against public directory	reco ok(namelist) OR reco nok
	if reco ok, displays the name or namelist (more details infollowing table), activates local reco with the index list if morethan one name, if reco nok, display/play a message to the user
Confirms by pressing the call button again if 1 name isdisplayed, or selects a name on the list (see following table)	Selection received (perhaps spoken index recognized first)	call(userID, number)	Checks the location ... [continues as described above]

Table 7: Interaction details of displaying andconfirming the recognition results

User action	Action on device	Events sent from device	Action on server	Events sent from server
... speaker utterance has been processed by the recogniser
A. Very high confidence, unique match, auto confirmation (NB! Iwould recommend letting the user confirm this explicitly; thiswould also make the application behaviour seem more consistent tothe user since some kind of confirmation would be needed everytime)
	Displays the name and shows/plays clear prompt "Calling ..."
	Fetches the number	call(userID, number)	Checks the location and presence status of the called party	call ok(picture) OR called party not available
B. High confidence, unique match, explicit confirmation
	Displays the name and picture, prompt asking "Place a call?"
Confirms by pressing the call button again	Fetches the number	call(userID, number)	Checks the location and presence status of the called party	call ok(picture) OR called party not available
C. High confidence with several matching entries, or mediumconfidence with either unique match or several matching entries
	Displays the namelist with indexes, activates index grammar onlocal reco; if multiple entries with same spelling, additional infoshould be added on the list
Selects a name by speaking the index or navigating to thecorrect name with keypad and pressing the call button	Fetches the number	call(userID, number)	Checks the location and presence status of the called party	call ok(picture) OR called party not available
D. Low confidence, no match from the directory/ies
	Prompts "Not found, please try again"
User speaks the name again	New recognition, on 2^nd or 3^rd 'nomatch',change the prompt to ~ "Sorry, no number found"

Table 8: No local recognition, all recognition inthe Network

User action	Action on device	Events sent from device	Action on server	Events sent from server
Turns on the device	Registers with the Directory Provider through the operator inthe NW	register user(userID)	Directory Provider gets register information, updates user'spresence and location info, loads user's personal info (buddy list,personal directory,...)	register ack
Pushes a button to place a call		init reco(userID)	Activates the personal directory and public directory	reco init ok
	Displays a prompt "Please say a name"
Speaks a name	Sends the utterance to be recognized in the network	send(userID, utterance)	Recognition against personal directory first, if no match therewith confidence greater than some threshold, then against publicdirectory	reco ok(namelist) OR reco nok

3. Acknowledgements

The following people contributed to this document:

Paulo Baggia, Loquendo
Art Barstow, Nokia
Emily Candell, Comverse
Debbie Dahl, Consultant and Working Group Chair
Stephen Potter, Microsoft
Vlad Sejnoha, Scansoft
Luc Van Tichelin, Scansoft
Tasos Anastasakos, Motorola
Lin Chen, Voice Genie
Jim Larson, Intel Architecture Lab
T.V. Raman, IBM
Derek Schwenke, Mitsubishi Electric
Giovanni Seni, Motorola
Dave Raggett, W3C/Openwave
Bennett Marks, Nokia
Katriina Halonen, Nokia
Ramalingam Hariharan, Nokia
Stephane Maes, IBM
Purush Yeluripati
Kuansan Wang, Microsoft

[8]ページ先頭