Please refer to theerratafor this document, which may include some normative corrections.
See alsotranslations.
Copyright © 2004W3C®(MIT,ERCIM,Keio), All Rights Reserved. W3Cliability,trademark,document use andsoftware licensing rules apply.
This document specifies VoiceXML, the Voice Extensible MarkupLanguage. VoiceXML is designed for creating audio dialogs thatfeature synthesized speech, digitized audio, recognition ofspoken and DTMF key input, recording of spoken input, telephony,and mixed initiative conversations. Its major goal is to bringthe advantages of Web-based development and content delivery tointeractive voice response applications.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in theW3C technical reports index at http://www.w3.org/TR/.
This document has been reviewed by W3C Members and otherinterested parties, and it has been endorsed by the Directoras aW3CRecommendation. W3C's role in making the Recommendation is todraw attention to the specification and to promote its widespreaddeployment. This enhances the functionaility and interoperabilityof the Web.
This specification is part of the W3C Speech Interface Frameworkand has been developed within theW3C Voice Browser Activity
The design of VoiceXML 2.0 has been widely reviewed (see thedisposition of comments) and satisfies the Working Group'stechnical requirements.A list of implementations is included in theVoiceXML 2.0 implementation report, along with the associated test suite.
Comments are welcome onwww-voice@w3.org (archive).SeeW3C mailing list and archive usageguidelines.
The W3C maintains a list ofany patentdisclosures related to this work.
In this document, the key words "must", "must not","required", "shall", "shall not", "should", "should not","recommended", "may", and "optional" are to be interpreted asdescribed in[RFC2119]and indicate requirement levels for compliant VoiceXMLimplementations.
This document defines VoiceXML, the Voice Extensible MarkupLanguage. Its background, basic concepts and use are presented inSection 1. The dialogconstructs of form, menu and link, and the mechanism (FormInterpretation Algorithm) by which they are interpreted are thenintroduced inSection 2. Userinput using DTMF and speech grammars is covered inSection 3, whileSection 4 covers system output using speechsynthesis and recorded audio. Mechanisms for manipulating dialogcontrol flow, including variables, events, and executableelements, are explained inSection5. Environment features such as parameters and properties aswell as resource handling are specified inSection 6. The appendices provide additionalinformation including theVoiceXML Schema, a detailed specification of theForm Interpretation Algorithmandtiming,audio file formats, andstatements relating toconformance,internationalization,accessibility andprivacy.
The origins of VoiceXML began in 1995 as an XML-based dialogdesign language intended to simplify the speech recognitionapplication development process within an AT&T project calledPhone Markup Language (PML). As AT&T reorganized, teams atAT&T, Lucent and Motorola continued working on their ownPML-like languages.
In 1998, W3C hosted a conference on voice browsers. By thistime, AT&T and Lucent had different variants of theiroriginal PML, while Motorola had developed VoxML, and IBM wasdeveloping its own SpeechML. Many other attendees at theconference were also developing similar languages for dialogdesign; for example, such as HP's TalkML and PipeBeach'sVoiceHTML.
The VoiceXML Forum was then formed by AT&T, IBM, Lucent,and Motorola to pool their efforts. The mission of the VoiceXMLForum was to define a standard dialog design language thatdevelopers could use to build conversational applications. Theychose XML as the basis for this effort because it was clear tothem that this was the direction technology was going.
In 2000, the VoiceXML Forum released VoiceXML 1.0 to thepublic. Shortly thereafter, VoiceXML 1.0 was submitted to the W3Cas the basis for the creation of a new international standard.VoiceXML 2.0 is the result of this work based on input from W3CMember companies, other W3C Working Groups, and the public.
Developers familiar with VoiceXML 1.0 are particularly directedtoChanges from PreviousPublic Version which summarizes how VoiceXML 2.0 differs fromVoiceXML 1.0.
VoiceXML is designed for creating audio dialogs that featuresynthesized speech, digitized audio, recognition of spoken andDTMF key input, recording of spoken input, telephony, andmixed initiative conversations. Its major goal is to bring theadvantages of Web-based development and content delivery tointeractive voice response applications.
Here are two short examples of VoiceXML. The first is thevenerable "Hello World":
<?xml version="1.0" encoding="UTF-8"?> <vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <block>Hello World!</block> </form></vxml>
The top-level element is <vxml>, which is mainly acontainer fordialogs. There are two types of dialogs:forms andmenus. Forms present information andgather input; menus offer choices of what to do next. Thisexample has a single form, which contains a block thatsynthesizes and presents "Hello World!" to the user. Since theform does not specify a successor dialog, the conversationends.
Our second example asks the user for a choice of drink andthen submits it to a server script:
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <field name="drink"> <prompt>Would you like coffee, tea, milk, or nothing?</prompt> <grammar src="drink.grxml" type="application/srgs+xml"/> </field> <block> <submit next="http://www.drink.example.com/drink2.asp"/> </block> </form></vxml>
Afield is an input field. The user must provide avalue for the field before proceeding to the next element in theform. A sample interaction is:
C(computer): Would you likecoffee, tea, milk, or nothing?
H(human): Orange juice.
C: I did not understand what you said.(aplatform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C:(continues in documentdrink2.asp)
This section contains a high-level architectural model, whoseterminology is then used to describe the goals of VoiceXML, itsscope, its design principles, and the requirements it places onthe systems that support it.
The architectural model assumed by this document has thefollowing components:
Figure 1: Architectural Model
Adocument server (e.g. a Web server) processesrequests from a client application,the VoiceXMLInterpreter, through theVoiceXML interpreter context.The server producesVoiceXML documents in reply, which areprocessed by the VoiceXML interpreter. The VoiceXML interpretercontext may monitor user inputs in parallel with the VoiceXMLinterpreter. For example, one VoiceXML interpreter context mayalways listen for a special escape phrase that takes the user toa high-level personal assistant, and another may listen forescape phrases that alter user preferences like volume ortext-to-speech characteristics.
Theimplementation platform is controlled by theVoiceXML interpreter context and by the VoiceXML interpreter. Forinstance, in an interactive voice response application, theVoiceXML interpreter context may be responsible for detecting anincoming call, acquiring the initial VoiceXML document,and answering the call, while the VoiceXML interpreter conductsthe dialog after answer. The implementation platform generatesevents in response to user actions (e.g. spoken or characterinput received, disconnect) and system events (e.g. timerexpiration). Some of these events are acted upon by the VoiceXMLinterpreter itself, as specified by the VoiceXML document, whileothers are acted upon by the VoiceXML interpreter context.
VoiceXML's main goal is to bring the full power of Webdevelopment and content delivery to voice response applications,and to free the authors of such applications from low-levelprogramming and resource management. It enables integration ofvoice services with data services using the familiarclient-server paradigm. A voice service is viewed as a sequenceof interaction dialogs between a user and an implementationplatform. The dialogs are provided by document servers, which maybe external to the implementation platform. Document serversmaintain overall service logic, perform database and legacysystem operations, and produce dialogs. A VoiceXML documentspecifies each interaction dialog to be conducted by a VoiceXMLinterpreter. User input affects dialog interpretation and iscollected into requests submitted to a document server. Thedocument server replies with another VoiceXML document tocontinue the user's session with other dialogs.
VoiceXML is a markup language that:
Minimizes client/server interactions by specifying multipleinteractions per document.
Shields application authors from low-level, andplatform-specific details.
Separates user interaction code (in VoiceXML) from servicelogic (e.g. CGI scripts).
Promotes service portability across implementation platforms.VoiceXML is a common language for content providers, toolproviders, and platform providers.
Is easy to use for simple interactions, and yet provideslanguage features to support complex dialogs.
While VoiceXML strives to accommodate the requirements of amajority of voice response services, services with stringentrequirements may best be served by dedicated applications thatemploy a finer level of control.
The language describes the human-machine interaction providedby voice response systems, which includes:
Output of synthesized speech (text-to-speech).
Output of audio files.
Recognition of spoken input.
Recognition of DTMF input.
Recording of spoken input.
Control of dialog flow.
Telephony features such as call transfer and disconnect.
The language provides means for collecting character and/orspoken input, assigning the input results to document-definedrequest variables, and making decisions that affect theinterpretation of documents written in the language. A documentmay be linked to other documents through Universal ResourceIdentifiers (URIs).
VoiceXML is an XML application[XML].
The language promotes portability of services throughabstraction of platform resources.
The language accommodates platform diversity in supportedaudio file formats, speech grammar formats, and URI schemes.While producers of platforms may support various grammar formatsthe language requires a common grammar format, namely the XMLForm of the W3C Speech Recognition Grammar Specification[SRGS], to facilitateinteroperability. Similarly, while various audio formats forplayback and recording may be supported, the audio formatsdescribed inAppendixE must be supported
The language supports ease of authoring for common types ofinteractions.
The language has well-defined semantics that preserves theauthor's intent regarding the behavior of interactions with theuser. Client heuristics are not required to determine documentelement interpretation.
The language recognizes semantic interpretations from grammarsand makes this information available to the application.
The language has a control flow mechanism.
The language enables a separation of service logic frominteraction behavior.
It is not intended for intensive computation, databaseoperations, or legacy system operations. These are assumed to behandled by resources outside the document interpreter, e.g. adocument server.
General service logic, state management, dialog generation,and dialog sequencing are assumed to reside outside the documentinterpreter.
The language provides ways to link documents using URIs, andalso to submit data to server scripts using URIs.
VoiceXML provides ways to identify exactly which data tosubmit to the server, and which HTTP method (GET or POST) to usein the submittal.
The language does not require document authors to explicitlyallocate and deallocate dialog resources, or deal withconcurrency. Resource allocation and concurrent threads ofcontrol are to be handled by the implementation platform.
This section outlines the requirements on thehardware/software platforms that will support a VoiceXMLinterpreter.
Document acquisition. The interpreter context isexpected to acquire documents for the VoiceXML interpreter to acton. The "http" URI scheme must be supported. In some cases, thedocument request is generated by the interpretation of a VoiceXMLdocument, while other requests are generated by the interpretercontext in response to events outside the scope of the language,for example an incoming phone call. When issuing documentrequests via http, the interpreter context identifies itselfusing the "User-Agent" header variable with the value"<name>/<version>", for example,"acme-browser/1.2"
Audio output. An implementation platform must supportaudio output using audio files and text-to-speech (TTS). Theplatform must be able to freely sequence TTS and audio output. Ifan audio output resource is not available, an error.noresourceevent must be thrown. Audio files are referred to by a URI. Thelanguage specifies a required set of audio file formats whichmust be supported (seeAppendix E); additional audio file formats mayalso be supported.
Audio input. An implementation platform is required todetect and report character and/or spoken input simultaneouslyand to control input detection interval duration with a timerwhose length is specified by a VoiceXML document. If an audioinput resource is not available, an error.noresource event mustbe thrown.
It must reportcharacters (for example, DTMF) enteredby a user. Platforms must support the XML form of DTMF grammarsdescribed in the W3C Speech Recognition Grammar Specification[SRGS]. They should alsosupport the Augmented BNF (ABNF) form of DTMF grammars describedin the W3C Speech Recognition Grammar Specification[SRGS].
It must be able to receivespeech recognition grammardata dynamically. It must be able to use speech grammar data inthe XML Form of the W3C Speech Recognition Grammar Specification[SRGS]. It should be able toreceive speech recognition grammar data in the ABNF form of theW3C Speech Recognition Grammar Specification[SRGS], and may support other formats such asthe JSpeech Grammar Format[JSGF] or proprietary formats. Some VoiceXMLelements contain speech grammar data; others refer to speechgrammar data through a URI. The speech recognizer must be able toaccommodate dynamic update of the spoken input for which it islistening through either method of speech grammar dataspecification.
It must be able torecord audio received from the user.The implementation platform must be able to make the recordingavailable to arequest variable. The language specifies arequired set of recorded audio file formats which must besupported (seeAppendixE); additional formats may also be supported.
Transfer The platform should be able to support makinga third party connection through a communications network, suchas the telephone.
A VoiceXMLdocument (or a set of related documentscalled anapplication) forms a conversational finite statemachine. The user is always in one conversational state, ordialog, at a time. Each dialog determines the next dialogto transition to.Transitions are specified using URIs,which define the next document and dialog to use. If a URI doesnot refer to a document, the current document is assumed. If itdoes not refer to a dialog, the first dialog in the document isassumed. Execution is terminated when a dialog does not specifya successor, or if it has an element that explicitly exits theconversation.
There are two kinds of dialogs:forms andmenus.Forms define an interaction that collects values for a set ofform item variables. Each field may specify a grammar thatdefines the allowable inputs for that field. If a form-levelgrammar is present, it can be used to fill several fields fromone utterance. A menu presents the user with a choice of optionsand then transitions to another dialog based on that choice.
Asubdialog is like a function call, in that itprovides a mechanism for invoking a new interaction, andreturning to the original form. Variable instances, grammars, andstate information are saved and are available upon returning tothe calling document. Subdialogs can be used, for example, tocreate a confirmation sequence that may require a database query;to create a set of components that may be shared among documentsin a single application; or to create a reusable library ofdialogs shared among many applications.
Asession begins when the user starts to interact witha VoiceXML interpreter context, continues as documents are loadedand processed, and ends when requested by the user, a document,or the interpreter context.
Anapplication is a set of documents sharing the sameapplication root document. Whenever the user interactswith a document in an application, its application root documentis also loaded. The application root document remains loadedwhile the user is transitioning between other documents in thesame application, and it is unloaded when the user transitions toa document that is not in the application. While it is loaded,the application root document's variables are available to theother documents as application variables, and its grammars remainactive for the duration of the application, subject to the grammaractivation rules discussed inSection 3.1.4.
Figure 2 shows the transition of documents (D) in anapplication that share a common application root document(root).
Figure 2: Transitioning between documents in an application.
Each dialog has one or more speech and/or DTMFgrammarsassociated with it. Inmachine directed applications, eachdialog's grammars are active only when the user is in thatdialog. Inmixed initiative applications, where the userand the machine alternate in determining what to do next, some ofthe dialogs are flagged to make their grammarsactive(i.e., listened for) even when the user is in another dialog inthe same document, or on another loaded document in the sameapplication. In this situation, if the user says somethingmatching another dialog's active grammars, executiontransitions to that other dialog, with the user's utterancetreated as if it were said in that dialog. Mixed initiative addsflexibility and power to voice applications.
VoiceXML provides a form-filling mechanism for handling"normal" user input. In addition, VoiceXML defines a mechanismfor handling events not covered by the form mechanism.
Events are thrown by the platform under a variety ofcircumstances, such as when the user does not respond, doesn'trespond intelligibly, requests help, etc. The interpreter alsothrows events if it finds a semantic error in a VoiceXMLdocument. Events are caught by catch elements or their syntacticshorthand. Each element in which an event can occur may specifycatch elements. Furthermore, catch elements are also inheritedfrom enclosing elements "as if by copy". In this way, commonevent handling behavior can be specified at any level, and itapplies to all lower levels.
Alink supports mixed initiative. It specifies agrammar that is active whenever the user is in the scope of thelink. If user input matches the link's grammar, controltransfers to the link's destination URI. A link can be usedto throw an event or go to a destination URI.
Element | Purpose | Section |
---|---|---|
<assign> | Assign a variable a value | 5.3.2 |
<audio> | Play an audio clip within aprompt | 4.1.3 |
<block> | A container of (non-interactive)executable code | 2.3.2 |
<catch> | Catch an event | 5.2.2 |
<choice> | Define a menu item | 2.2.2 |
<clear> | Clear one or more form itemvariables | 5.3.3 |
<disconnect> | Disconnect a session | 5.3.11 |
<else> | Used in <if> elements | 5.3.4 |
<elseif> | Used in <if> elements | 5.3.4 |
<enumerate> | Shorthand for enumerating the choicesin a menu | 2.2.4 |
<error> | Catch an error event | 5.2.3 |
<exit> | Exit a session | 5.3.9 |
<field> | Declares an input field in aform | 2.3.1 |
<filled> | An action executed when fields arefilled | 2.4 |
<form> | A dialog for presenting informationand collecting data | 2.1 |
<goto> | Go to another dialog in the same ordifferent document | 5.3.7 |
<grammar> | Specify a speech recognition or DTMFgrammar | 3.1 |
<help> | Catch a help event | 5.2.3 |
<if> | Simple conditional logic | 5.3.4 |
<initial> | Declares initial logic upon entryinto a (mixed initiative) form | 2.3.3 |
<link> | Specify a transition common to alldialogs in the link's scope | 2.5 |
<log> | Generate a debug message | 5.3.13 |
<menu> | A dialog for choosing amongstalternative destinations | 2.2.1 |
<meta> | Define a metadata item as aname/value pair | 6.2.1 |
<metadata> | Define metadata information using ametadata schema | 6.2.2 |
<noinput> | Catch a noinput event | 5.2.3 |
<nomatch> | Catch a nomatch event | 5.2.3 |
<object> | Interact with a custom extension | 2.3.5 |
<option> | Specify an option in a<field> | 2.3.1.3 |
<param> | Parameter in <object> or<subdialog> | 6.4 |
<prompt> | Queue speech synthesis and audiooutput to the user | 4.1 |
<property> | Control implementation platformsettings. | 6.3 |
<record> | Record an audio sample | 2.3.6 |
<reprompt> | Play a field prompt when a field isre-visited after an event | 5.3.6 |
<return> | Return from a subdialog. | 5.3.10 |
<script> | Specify a block of ECMAScriptclient-side scripting logic | 5.3.12 |
<subdialog> | Invoke another dialog as a subdialogof the current one | 2.3.4 |
<submit> | Submit values to a documentserver | 5.3.8 |
<throw> | Throw an event. | 5.2.1 |
<transfer> | Transfer the caller to anotherdestination | 2.3.7 |
<value> | Insert the value of an expression ina prompt | 4.1.4 |
<var> | Declare a variable | 5.3.1 |
<vxml> | Top-level element in each VoiceXMLdocument | 1.5.1 |
A VoiceXML document is primarily composed of top-levelelements calleddialogs. There are two types of dialogs:forms andmenus. A document may also have<meta> and <metadata> elements, <var> and<script> elements, <property> elements, <catch>elements, and <link> elements.
Document execution begins at the first dialog by default. Aseach dialog executes, it determines the next dialog. When adialog doesn't specify a successor dialog, documentexecution stops.
Here is "Hello World!" expanded to illustrate some of this. Itnow has a document level variable called "hi" which holds thegreeting. Its value is used as the prompt in the first form. Oncethe first form plays the greeting, it goes to the form named"say_goodbye", which prompts the user with "Goodbye!" Because thesecond form does not transition to another dialog, it causes thedocument to be exited.
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <meta name="author" content="John Doe"/> <meta name="maintainer" content="hello-support@hi.example.com"/> <var name="hi" expr="'Hello World!'"/> <form> <block> <value expr="hi"/> <goto next="#say_goodbye"/> </block> </form> <form> <block> Goodbye! </block> </form></vxml>
Alternatively the forms can be combined:
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <meta name="author" content="John Doe"/> <meta name="maintainer" content="hello-support@hi.example.com"/> <var name="hi" expr="'Hello World!'"/> <form> <block> <value expr="hi"/> Goodbye! </block> </form></vxml>
Attributes of <vxml> include:
version | The version of VoiceXML of thisdocument (required). The current version number is 2.0. |
---|---|
xmlns | The designated namespace for VoiceXML(required). The namespace for VoiceXML is defined to behttp://www.w3.org/2001/vxml. |
xml:base | The base URI for this document asdefined in[XML-BASE].As in[HTML], a URI whichall relative references within the document take as theirbase. |
xml:lang | Thelanguage identifier for this document .If omitted, the value is a platform-specific default. |
application | The URI of this document'sapplication root document, if any. |
Language information is inherited down the document hierarchy:the value of "xml:lang" is inherited by elements which alsodefine the "xml:lang" attribute, such as <grammar> and<prompt>, unless these elements specify an alternativevalue.
Normally, each document runs as an isolated application. Incases where you want multiple documents to work together as oneapplication, you select one document to be theapplicationroot document, and the rest to beapplication leafdocuments. Each leaf document names the root document in its<vxml> element.
When this is done, every time the interpreter is told to loadand execute a leaf document in this application, it first loadsthe application root document if it is not already loaded. Theapplication root document remains loaded until the interpreter istold to load a document that belongs to a different application.Thus one of the following two conditions always holds duringinterpretation:
The application root document is loaded and the user isexecuting in it: there is no leaf document.
The application root document and a single leaf document areboth loaded and the user is executing in the leaf document.
If there is a chain of subdialogs defined in separatedocuments, then there may be more than one leaf document loadedalthough execution will only be in one of these documents.
When a leaf document load causes a root document load, none ofthe dialogs in the root document are executed. Execution beginsin the leaf document.
There are several benefits to multi-document applications.
Here is a two-document application illustrating this:
Application root document (app-root.vxml)
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <var name="bye" expr="'Ciao'"/> <link next="operator_xfer.vxml"> <grammar type="application/srgs+xml" root="root" version="1.0"> <rule scope="public">operator</rule> </grammar> </link></vxml>
Leaf document (leaf.vxml)
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0" application="app-root.vxml"> <form> <field name="answer"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>Shall we say <value expr="application.bye"/>?</prompt> <filled> <if cond="answer"> <exit/> </if> <clear namelist="answer"/> </filled> </field> </form></vxml>
In this example, the application is designed so that leaf.vxmlmust be loaded first. Its application attribute specifies thatapp-root.vxml should be used as the application root document.So, app-root.vxml is then loaded, which creates the applicationvariable bye and also defines a link that navigates tooperator-xfer.vxml whenever the user says "operator". The userstarts out in the say_goodbye form:
C: Shall we say Ciao?
H: Si.
C: I did not understand what you said.(aplatform-specific default message.)
C: Shall we say Ciao?
H: Ciao
C: I did not understand what you said.
H: Operator.
C:(Goes to operator_xfer.vxml, whichtransfers the caller to a human operator.)
Note that when the user is in a multi-document application, atmost two documents are loaded at any one time: the applicationroot document and, unless the user is actually interacting withthe application root document, an application leaf document. Aroot document's <vxml> element does not have an applicationattribute specified. A leaf document's <vxml> element doeshave an application attribute specified. An interpreter alwayshas an application root document loaded; it does not always havean application leaf document loaded.
Thename of the interpreter's current application is theapplication root document's absolute URI. The absolute URIincludes a query string, if present, but it does not include afragment identifier. The interpreter remains in the sameapplication as long as the name remains the same. When the namechanges, a new application is entered and its root context isinitialized. The application's root context consists of thevariables, grammars, catch elements, scripts, and properties inapplication scope.
During a user session an interpreter transitions from onedocument to another as requested by <choice>, <goto><link>, <subdialog>, and <submit> elements.Some transitions are within an application, others are betweenapplications. The preservation or initialization of the rootcontext depends on the type of transition:
If a document refers to a non-existent application rootdocument, an error.badfetch event is thrown. If a document'sapplication attribute refers to a document that also has anapplication attribute specified, an error.semantic event isthrown.
The following diagrams illustrate the effect of thetransitions between root and leaf documents on the applicationroot context. In these diagrams, boxes represent documents, boxtexture changes identify root context initialization, solidarrows symbolize transitions to the URI in the arrow's label,dashed vertical arrows indicate an application attribute whoseURI is the arrow's label.
Figure 3: Transitions that Preserve the Root Context
In this diagram, all the documents belong to the sameapplication. The transitions are identified by the numbers 1-4across the top of the figure. They are:
The next diagram illustrates transitions which initialize theroot context.
Figure 4: Transitions that Initialize the Root Context
A subdialog is a mechanism for decomposing complex sequencesof dialogs to better structure them, or to create reusablecomponents. For example, the solicitation of account informationmay involve gathering several pieces of information, such asaccount number, and home telephone number. A customer careservice might be structured with several independent applicationsthat could share this basic building block, thus it would bereasonable to construct it as a subdialog. This is illustrated inthe example below. The first document, app.vxml, seeks to adjusta customer's account, and in doing so must get the accountinformation and then the adjustment level. The accountinformation is obtained by using a subdialog element that invokesanother VoiceXML document to solicit the user input. While thesecond document is being executed, the calling dialog issuspended, awaiting the return of information. The seconddocument provides the results of its user interactions using a<return> element, and the resulting values are accessedthrough the variable defined by the name attribute on the<subdialog> element.
Customer Service Application (app.vxml)
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <var name="account_number"/> <var name="home_phone"/> <subdialog name="accountinfo" src="acct_info.vxml#basic"> <filled> <!-- Note the variable defined by "accountinfo" is returned as an ECMAScript object and it contains two properties defined by the variables specified in the "return" element of the subdialog. --> <assign name="account_number" expr="accountinfo.acctnum"/> <assign name="home_phone" expr="accountinfo.acctphone"/> </filled> </subdialog> <field name="adjustment_amount"> <grammar type="application/srgs+xml" src="/grammars/currency.grxml"/> <prompt> What is the value of your account adjustment? </prompt> <filled> <submit next="/cgi-bin/updateaccount"/> </filled> </field> </form></vxml>
Document Containing Account Information Subdialog(acct_info.vxml)
<?xml version="1.0" encoding="UTF-8"?><vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <field name="acctnum"> <grammar type="application/srgs+xml" src="/grammars/digits.grxml"/> <prompt> What is your account number? </prompt> </field> <field name="acctphone"> <grammar type="application/srgs+xml" src="/grammars/phone_numbers.grxml"/> <prompt> What is your home telephone number? </prompt> <filled> <!-- The values obtained by the two fields are supplied to the calling dialog by the "return" element. --> <return namelist="acctnum acctphone"/> </filled> </field> </form></vxml>
Subdialogs add a new execution context when they areinvoked.The subdialog could be a new dialog within the existingdocument, or a new dialog within a new document.
Subdialogs can be composed of several documents. Figure 5shows the execution flow where a sequence of documents (D)transitions to a subdialog (SD) and then back.
Figure 5: Subdialog composed of several documents
The execution context in dialog D2 is suspended when itinvokes the subdialog SD1 in document sd1.vxml. This subdialogspecifies execution is to be transfered to the dialog in sd2.vxml(using <goto>). Consequently, when the dialog in sd2.vxmlreturns, control is returned directly to dialog D2.
Figure 6 shows an example of a multi-document subdialog wherecontrol is transferred from one subdialog to another.
Figure 6: Subdialog composed of several documents
The subdialog in sd1.vxml specifies that control is to betransfered to a second subdialog, SD2, in sd2.vxml. Whenexecuting SD2, there are two suspended contexts: the dialogcontext in D2 is suspending awaiting SD1 to return; and thedialog context in SD1 awaiting SD2 to return. When SD2 returns,control is returned to the SD1. It in turn returns control todialog D2.
Under certain circumstances (in particular, while the VoiceXMLinterpreter is processing a disconnect event) the interpreter maycontinue executing in thefinal processing state afterthere is no longer a connection to allow the interpreter tointeract with the end user. The purpose of this state is to allowthe VoiceXML application to perform any necessary final cleanup,such as submitting information to the application server. Forexample, the following <catch> element will catch theconnection.disconnect.hangup event and execute in the finalprocessing state:
<catch event="connection.disconnect.hangup"> <submit namelist="myExit" next="http://mysite/exit.jsp"/></catch>
While in the final processing state the application mustremain in the transitioning state and may not enter the waitingstate (as described inSection4.1.8). Thus for example the application should not enter<field>, <record>, or <transfer> while in thefinal processing state. The VoiceXML interpreter must exit if theVoiceXML application attempts to enter the waiting state while inthe final processing state.
Aside from this restriction, execution of the VoiceXMLapplication continues normally while in the final processingstate. Thus for example the application may transition betweendocuments while in the final processing state, and theinterpreter must exit if no form item is eligible to be selected(as described inSection2.1.1).
Forms are the key component of VoiceXML documents. A formcontains:
A set ofform items, elements that are visited in themain loop of the form interpretation algorithm. Form items aresubdivided intoinput items that can be 'filled' by userinput andcontrol items that cannot.
Declarations of non-form item variables.
Event handlers.
"Filled" actions, blocks of procedural logic that execute whencertain combinations of input item variables are assigned.
Form attributes are:
id | The name of the form. If specified,the form can be referenced within the document or from anotherdocument. For instance <form>, <gotonext="#weather">. |
---|---|
scope | The default scope of the form'sgrammars. If it is dialog then the form grammars are active onlyin the form. If the scope is document, then the form grammars areactive during any dialog in the same document. If the scope isdocument and the document is an application root document, thenthe form grammars are active during any dialog in any document ofthis application. Note that the scope of individual form grammarstakes precedence over the default scope; for example, in non-rootdocuments a form with the default scope "dialog", and a formgrammar with the scope "document", then that grammar is active inany dialog in the document. |
This section describes some of the concepts behind forms, andthen gives some detailed examples of their operation.
Forms are interpreted by an implicit form interpretationalgorithm (FIA). The FIA has a main loop that repeatedly selectsa form item and then visits it. The selected form item is thefirst in document order whose guard condition is not satisfied.For instance, a field's default guard condition tests tosee if the field's form item variable has a value, so thatif a simple form contains only fields, the user will be promptedfor each field in turn.
Interpreting a form item generally involves:
Selecting and playing one or more prompts;
Collecting a user input, either a response that fills in oneor more input items, or a throwing of some event (help, forinstance); and
Interpreting any <filled> actions that pertained to thenewly filled in input items.
The FIA ends when it interprets a transfer of controlstatement (e.g. a <goto> to another dialog or document, ora <submit> of data to the document server). It also endswith an implied <exit> when no form item remains eligibleto select.
The FIA is described in more detail inSection 2.1.6.
Form items are the elements that can be visited in the mainloop of the form interpretation algorithm. Input items direct theFIA to gather a result for a specific element. When the FIAselects a control item, the control item may contain a block ofprocedural code to execute, or it may tell the FIA to set up theinitial prompt-and-collect for a mixed initiative form.
An input item specifies aninput item variable togather from the user. Input items have prompts to tell the userwhat to say or key in, grammars that define the allowed inputs,and event handlers that process any resulting events. An inputitem may also have a <filled> element that defines anaction to take just after the input item variable is filled.Input items consist of:
<field> | An input item whose value is obtainedvia ASR or DTMF grammars. |
---|---|
<record> | An input item whose value is an audioclip recorded by the user. A <record> element could collecta voice mail message, for instance. |
<transfer> | An input item which transfers theuser to another telephone number. If the transfer returnscontrol, the field variable will be set to the resultstatus. |
<object> | This input item invokes aplatform-specific "object" with various parameters. The result ofthe platform object is an ECMAScript Object. One platform objectcould be a builtin dialog that gathers credit card information.Another could gather a text message using some proprietary DTMFtext entry method. There is no requirement for implementations toprovide platform-specific objects, although implementations musthandle the <object> element by throwingerror.unsupported.objectname if the particular platform-specificobject is not supported (note that 'objectname' inerror.unsupported.objectname is a fixed string, so notsubstituted with the name of the unsupported object; morespecific error information may be provided in the event"_message" special variable as described inSection 5.2.2). |
<subdialog> | A <subdialog> input item isroughly like a function call. It invokes another dialog on thecurrent page, or invokes another VoiceXML document. It returns anECMAScript Object as its result. |
There are two types of control items:
<block> | A sequence of procedural statementsused for prompting and computation, but not for gathering input.A block has a (normally implicit) form item variable that is setto true just before it is interpreted. |
---|---|
<initial> | This element controls the initialinteraction in a mixed initiative form. Its prompts should bewritten to encourage the user to say something matching a formlevel grammar. When at least one input item variable is filled asa result of recognition during an <initial> element, theform item variable of <initial> becomes true, thus removingit as an alternative for the FIA. |
Each form item has an associatedform item variable,which by default is set to undefined when the form is entered.This form item variable will contain the result of interpretingthe form item. An input item's form item variable is alsocalled aninput item variable, and it holds the valuecollected from the user. A form item variable can be given a nameusing the name attribute, or left nameless, in which case aninternal name is generated.
Each form item also has aguard condition, whichgoverns whether or not that form item can be selected by the forminterpretation algorithm. The default guard condition just teststo see if the form item variable has a value. If it does, theform item will not be visited.
Typically, input items are given names, but control items arenot. Generally form item variables are not given initial valuesand additional guard conditions are not specified. But sometimesthere is a need for more detailed control. One form may have aform item variable initially set to hide a field, and latercleared (e.g., using <clear>) to force the field'scollection. Another field may have a guard condition thatactivates it only when it has not been collected, and when twoother fields have been filled. A block item could execute onlywhen some condition holds true. Thus, fine control can beexercised over the order in which form items are selected andexecuted by the FIA, however in general, many dialogs can beconstructed without resorting to this level of complexity.
In summary, all form items have the following attributes:
name | The name of a dialog-scoped form itemvariable that will hold the value of the form item. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be executed unless the formitem variable is cleared. |
cond | An expression to evaluate inconjunction with the test of the form item variable. If absent,this defaults to true, or in the case of <initial>, a testto see if any input item variable has been filled in. |
The simplest and most common type of form is one in which theform items are executed exactly once in sequential order toimplement a computer-directed interaction. Here is a weatherinformation service that uses such a form.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <block>Welcome to the weather information service.</block> <field name="state"> <prompt>What state?</prompt> <grammar src="state.grxml" type="application/srgs+xml"/> <catch event="help"> Please speak the state for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.grxml" type="application/srgs+xml"/> <catch event="help"> Please speak the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city state"/> </block></form></vxml>
This dialog proceeds sequentially:
C (computer): Welcome to the weather information service. Whatstate?
H (human): Help
C: Please speak the state for which you want the weather.
H: Georgia
C: What city?
H: Tblisi
C: I did not understand what you said. What city?
H: Macon
C: The conditions in Macon Georgia are sunny and clear at 11AM ...
The form interpretation algorithm's first iterationselects the first block, since its (hidden) form item variable isinitially undefined. This block outputs the main prompt, and itsform item variable is set to true. On the FIA's seconditeration, the first block is skipped because its form itemvariable is now defined, and the state field is selected becausethe dialog variable state is undefined. This field prompts theuser for the state, and then sets the variable state to theanswer. A detailed description of the filling of form itemvariables from a field-level grammar may be found inSection 3.1.6. The third formiteration prompts and collects the city field. The fourthiteration executes the final block and transitions to a differentURI.
Each field in this example has a prompt to play in order toelicit a response, a grammar that specifies what to listen for,and an event handler for the help event. The help event is thrownwhenever the user asks for assistance. The help event handlercatches these events and plays a more detailed prompt.
Here is a second directed form, one that prompts for creditcard information:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <block>We now need your credit card type, number, and expiration date.</block> <field name="card_type"> <prompt count="1">What kind of credit card do you have?</prompt> <prompt count="2">Type of card?</prompt> <!-- This is an inline grammar. --> <grammar type="application/srgs+xml" root="r2" version="1.0"> <rule scope="public"> <one-of> <item>visa</item> <item>master <item repeat="0-1">card</item></item> <item>amex</item> <item>american express</item> </one-of> </rule> </grammar> <help> Please say Visa, MasterCard, or American Express.</help> </field> <field name="card_num"> <grammar type="application/srgs+xml" src="/grammars/digits.grxml"/> <prompt count="1">What is your card number?</prompt> <prompt count="2">Card number?</prompt> <catch event="help"> <if cond="card_type =='amex' || card_type =='american express'"> Please say or key in your 15 digit card number. <else/> Please say or key in your 16 digit card number. </if> </catch> <filled> <if cond="(card_type == 'amex' || card_type =='american express') && card_num.length != 15"> American Express card numbers must have 15 digits. <clear namelist="card_num"/> <throw event="nomatch"/> <elseif cond="card_type != 'amex' && card_type !='american express' && card_num.length != 16"/> MasterCard and Visa card numbers have 16 digits. <clear namelist="card_num"/> <throw event="nomatch"/> </if> </filled> </field> <field name="expiry_date"> <grammar type="application/srgs+xml" src="/grammars/digits.grxml"/> <prompt count="1">What is your card's expiration date?</prompt> <prompt count="2">Expiration date?</prompt> <help> Say or key in the expiration date, for example one two oh one. </help> <filled> <!-- validate the mmyy --> <var name="mm"/> <var name="i" expr="expiry_date.length"/> <if cond="i == 3"> <assign name="mm" expr="expiry_date.substring(0,1)"/> <elseif cond="i == 4"/> <assign name="mm" expr="expiry_date.substring(0,2)"/> </if> <if cond="mm == '' || mm < 1 || mm > 12"> <clear namelist="expiry_date"/> <throw event="nomatch"/> </if> </filled> </field> <field name="confirm"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt> I have <value expr="card_type"/> number <value expr="card_num"/>, expiring on <value expr="expiry_date"/>. Is this correct? </prompt> <filled> <if cond="confirm"> <submit next="place_order.asp" namelist="card_type card_num expiry_date"/> </if> <clear namelist="card_type card_num expiry_date confirm"/> </filled> </field></form></vxml>
Note that the grammar alternatives 'amex' and 'americanexpress' return literal values which need to be handledseparately in the conditional expressions.Section 3.1.5 describes how semantic attachmentsin the grammar can be used to return a single representation ofthese inputs.
The dialog might go something like this:
C: We now need your credit card type, number, and expirationdate.
C: What kind of credit card do you have?
H: Discover
C: I did not understand what you said.(aplatform-specific default message.)
C: Type of card?(the second prompt isused now.)
H: Shoot.(fortunately treated as "help"by this platform)
C: Please say Visa, MasterCard, or American Express.
H: Uh, Amex.(this platform ignores"uh")
C: What is your card number?
H: One two three four ... wait ...
C: I did not understand what you said.
C: Card number?
H:(uses DTMF) 1 2 3 4 5 6 7 8 9 01 2 3 4 5 #
C: What is your card's expiration date?
H: one two oh one
C: I have Amex number 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 expiringon 1 2 0 1. Is this correct?
H: Yes
Fields are the major building blocks of forms. A fielddeclares a variable and specifies the prompts, grammars, DTMFsequences, help messages, and other event handlers that are usedto obtain it. Each field declares a VoiceXML form item variablein the form's dialog scope. These may be submitted once theform is filled, or copied into other variables.
Each field has its own speech and/or DTMF grammars, specifiedexplicitly using <grammar> elements, or implicitly usingthe type attribute. The type attribute is used for builtingrammars, like digits, boolean, or number.
Each field can have one or more prompts. If there is one, itis repeatedly used to prompt the user for the value until one isprovided. If there are many, prompts are selected for playbackaccording to the prompt selection algorithm (seeSection 4.1.6). The countattribute can be used to determine which prompts to use on eachattempt. In the example, prompts become shorter. This is calledtapered prompting.
The <catch event="help"> elements are event handlersthat define what to do when the user asks for help. Help messagescan also be tapered. These can be abbreviated, so that thefollowing two elements are equivalent:
<catch event="help"> Please say visa, mastercard, or amex.</catch><help> Please say visa, mastercard, or amex.</help>
The <filled> element defines what to do when the userprovides a recognized input for that field. One use is to specifyintegrity constraints over and above the checking done by thegrammars, as with the date field above.
The last section talked about forms implementing rigid,computer-directed conversations. To make a formmixedinitiative, where both the computer and the human directthe conversation, it must have one or more form-level grammars.The dialog may be written in several ways. One common authoringtyle combines an <initial> element that prompts for ageneral response with <field> elements that prompt forspecific information. This is illustrated in the examplebelow. More complex techniques, such as using the 'cond'attribute on <field> elements, may achieve a similareffect.
If aform has form-level grammars:
Its input items can be filled in any order.
More than one input item can be filled as a result of a singleuser utterance.
Only input items (and not control items) can be filled as aresult of matching a form-level grammar. The filling of fieldvariables when using a form-level grammar is described inSection 3.1.6.
Also, the form's grammars can be active when the user isin other dialogs. If a document has two forms on it, say a carrental form and a hotel reservation form, and both forms havegrammars that are active for that document, a user could respondto a request for hotel reservation information with informationabout the car rental, and thus direct the computer to talk aboutthe car rental instead. The user can speak to any active grammar,and have input items set and actions taken in response.
Example. Here is a second version of the weatherinformation service, showing mixed initiative. It has been"enhanced" for illustrative purposes with advertising and with aconfirmation of the city and state:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <grammar src="cityandstate.grxml" type="application/srgs+xml"/> <!-- Caller can't barge in on today's advertisement. --> <block> <prompt bargein="false"> Welcome to the weather information service. <audio src="http://www.online-ads.example.com/wis.wav"/> </prompt> </block> <initial name="start"> <prompt> For what city and state would you like the weather? </prompt> <help> Please say the name of the city and state for which you would like a weather report. </help> <!-- If user is silent, reprompt once, then try directed prompts. --> <noinput count="1"> <reprompt/></noinput> <noinput count="2"> <reprompt/> <assign name="start" expr="true"/></noinput> </initial> <field name="state"> <prompt>What state?</prompt> <help> Please speak the state for which you want the weather. </help> </field> <field name="city"> <prompt>Please say the city in <value expr="state"/> for which you want the weather.</prompt> <help>Please speak the city for which you want the weather.</help> <filled> <!-- Most of our customers are in LA. --> <if cond="city == 'Los Angeles' && state == undefined"> <assign name="state" expr="'California'"/> </if> </filled> </field> <field name="go_ahead" modal="true"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>Do you want to hear the weather for <value expr="city"/>, <value expr="state"/>? </prompt> <filled> <if cond="go_ahead"> <prompt bargein="false"> <audio src="http://www.online-ads.example.com/wis2.wav"/> </prompt> <submit next="/servlet/weather" namelist="city state"/> </if> <clear namelist="start city state go_ahead"/> </filled> </field></form></vxml>
Here is a transcript showing the advantages for even a noviceuser:
C: Welcome to the weather information service. Buy Joe'sSpicy Shrimp Sauce.
C: For what city and state would you like the weather?
H: Uh, California.
C: Please say the city in California for which you want theweather.
H: San Francisco, please.
C: Do you want to hear the weather for San Francisco,California?
H: No
C: For what city and state would you like the weather?
H: Los Angeles.
C: Do you want to hear the weather for Los Angeles,California?
H: Yes
C: Don't forget, buy Joe's Spicy Shrimp Saucetonight!
C: Mostly sunny today with highs in the 80s. Lows tonight fromthe low 60s ...
The go_ahead field has its modal attribute set to true. Thiscauses all grammars to be disabled except the ones defined in thecurrent form item, so that the only grammar active during thisfield is the grammar for boolean.
An experienced user can get things done much faster (but isstill forced to listen to the ads):
C: Welcome to the weather information service. Buy Joe'sSpicy Shrimp Sauce.
C: What ...
H(barging in): LA
C: Do you ...
H(barging in): Yes
C: Don't forget, buy Joe's Spicy Shrimp Saucetonight!
C: Mostly sunny today with highs in the 80s. Lows tonight fromthe low 60s ...
The form interpretation algorithm can be customized in severalways. One way is to assign a value to a form item variable, sothat its form item will not be selected. Another is to use<clear> to set a form item variable to undefined; thisforces the FIA to revisit the form item again.
Another method is to explicitly specify the next form item tovisit using <goto nextitem>. This forces an immediatetransfer to that form item even if any cond attribute presentevaluates to "false". No variables, conditions or counters inthe targeted form item will be reset. The form item's promptwill be played even if it has already been visited. If the<goto nextitem> occurs in a <filled> action, the restof the <filled> action and any pending <filled>actions will be skipped.
Here is an example <goto nextitem> executed in responseto the exit event:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><link event="exit"> <grammar type="application/srgs+xml" src="/grammars/exit.grxml"/></link><form> <catch event="exit"> <reprompt/> <goto nextitem="confirm_exit"/> </catch> <block> <prompt> Hello, you have been called at random to answer questions critical to U.S. foreign policy. </prompt> </block> <field name="q1"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>Do you agree with the IMF position on privatizing certain functions of Burkina Faso's agriculture ministry?</prompt> </field> <field name="q2"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>If this privatization occurs, will its effects be beneficial mainly to Ouagadougou and Bobo-Dioulasso?</prompt> </field> <field name="q3"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>Do you agree that sorghum and millet output might thereby increase by as much as four percent per annum?</prompt> </field> <block> <submit next="register" namelist="q1 q2 q3"/> </block> <field name="confirm_exit"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>You have elected to exit. Are you sure you want to do this, and perhaps adversely affect U.S. foreign policy vis-a-vis sub-Saharan Africa for decades to come?</prompt> <filled> <if cond="confirm_exit"> Okay, but the U.S. State Department is displeased. <exit/> <else/> Good, let's pick up where we left off. <clear namelist="confirm_exit"/> </if> </filled> <catch event="noinput nomatch"> <throw event="exit"/> </catch> </field></form></vxml>
If the user says "exit" in response to any of the surveyquestions, an exit event is thrown by the platform and caught bythe <catch> event handler. This handler directs thatconfirm_exit be the next visited field. The confirm_exit fieldwould not be visited during normal completion of the surveybecause the preceding <block> element transfers control tothe registration script.
We've presented the form interpretation algorithm (FIA)at a conceptual level. In this section we describe it in moredetail. A more formal description is provided inAppendix C.
Whenever a form is entered, it is initialized. Internal promptcounter variables (in the form's dialog scope) are reset to1. Each variable (form-level <var> elements and form itemvariables) is initialized, in document order, to undefined or tothe value of the relevant expr attribute.
The main loop of the FIA has three phases:
Theselect phase: the next unfilled form item isselected for visiting.
Thecollect phase: the selectedform item isvisited, which prompts the user for input, enables theappropriate grammars, and then waits for and collects aninput (such as a spoken phrase or DTMF key presses) oranevent (such as a request for help or a no inputtimeout).
Theprocess phase: an input is processed by fillingform items and executing <filled> elements to performactions such as input validation. An event is processed byexecuting the appropriate event handler for that event type.
Note that the FIA may be given an input (a set of grammarslot/slot value pairs) that was collected while the user was in adifferent form's FIA. In this case the first iteration ofthe main loop skips the select and collect phases, and goes rightto the process phase with that input. Also note that if anerror occurs in the select or collect phase that causes an eventto be generated, the event is thrown and the FIA moves directlyinto the process phase.
The purpose of the select phase is to select the next formitem to visit. This is done as follows:
If a <goto> from the last main loop iteration'sprocess phase specified a <goto nextitem>, then thespecified form item is selected.
Otherwise the first form item whoseguard conditionis false is chosen to be visited. If an error occurs whilechecking guard conditions, the event is thrown which skips thecollect phase, and is handled in the process phase.
If no guard condition is false, and the last iterationcompleted the form without encountering an explicit transfer ofcontrol, the FIA does an implicit <exit> operation(similarly, if execution proceeds outside of a form, such as whenan error is generated outside of a form, and there is no explicittransfer of control, the interpreter will perform an implicit<exit> operation).
The purpose of the collect phase is to collect an input or anevent. The selected form item isvisited, which performsactions that depend on the type of form item:
If a <field> or <record> is visited,the FIA selects and queues up any prompts based on theitem's prompt counter and the prompt conditions. Then itactivates and listens for the field level grammar(s) and anyactive higher-level grammars, and waits for the item to befilled or for some event to be generated.
If a <transfer> is visited, the prompts are queued basedon the item's prompt counter and the prompt conditions. Theitem grammars are activated. The queue is played before thetransfer is executed.
If a <subdialog> or <object> is visited, theprompts are queued based on the item's prompt counter andthe prompt conditions. Grammars are not activated. Instead, theinput collection behavior is specified by the executing contextfor the subdialog or object. The queue is not played before thesubdialog or object is executed, but instead should be playedduring the subsequent input collection.
If an <initial> is visited, the FIA selects and queuesup prompts based on the <initial>'s prompt counterand prompt conditions. Then it listens for the form levelgrammar(s) and any active higher-level grammars. It waits for agrammar recognition or for an event.
A <block> element is visited by setting its form itemvariable to true, evaluating its content, and then bypassing theprocess phase. No input is collected, and the next iteration ofthe FIA's main loop is entered.
The purpose of the process phase is to process the input orevent collected during the previous phases, as follows:
If an input matches a grammar in this form, then:
After completion of the process phase, interpretationcontinues by returning to the select phase.
A more detailed form interpretation algorithm can be found inAppendix C.
Amenu is a convenient syntactic shorthand for a formcontaining a single anonymous field that prompts the user to makea choice and transitions to different places based on thatchoice. Like a regular form, it can have its grammar scoped suchthat it is active when the user is executing another dialog. Thefollowing menu offers the user three choices:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu> <prompt> Welcome home. Say one of: <enumerate/> </prompt> <choice next="http://www.sports.example.com/vxml/start.vxml"> Sports </choice> <choice next="http://www.weather.example.com/intro.vxml"> Weather </choice> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"> Stargazer astrophysics news </choice> <noinput>Please say one of <enumerate/></noinput></menu></vxml>
This dialog might proceed as follows:
C: Welcome home. Say one of: sports; weather; Stargazerastrophysics news.
H: Astrology.
C: I did not understand what you said.(aplatform-specific default message.)
C: Welcome home. Say one of: sports; weather; Stargazerastrophysics news.
H: sports.
C:(proceeds tohttp://www.sports.example.com/vxml/start.vxml)
This identifies the menu, and determines the scope of itsgrammars. The menu element's attributes are:
id | The identifier of the menu. It allowsthe menu to be the target of a <goto> or a<submit>. |
---|---|
scope | The menu's grammar scope. If itis dialog (the default), the menu's grammars are onlyactive when the user transitions into the menu. If the scope isdocument, its grammars are active over the whole document (or ifthe menu is in the application root document, any loaded documentin the application). |
dtmf | When set to true, the first ninechoices that have not explicitly specified a value for the dtmfattribute are given the implicit ones "1", "2", etc. Remainingchoices that have not explicitly specified a value for the dtmfattribute will not be assigned DTMF values (and thus cannot bematched via a DTMF keypress). If there are choices which havespecified their own DTMF sequences to be something other than"*", "#", or "0", an error.badfetch will be thrown. The defaultis false. |
accept | When set to "exact" (the default),the text of the choice elements in the menu defines the exactphrase to be recognized. When set to "approximate", the text ofthe choice elements defines an approximate recognition phrase (asdescribed underSection2.2.5). Each <choice> can override this setting. |
The <choice> element serves several purposes:
It may specify a speech grammar, defined either using a<grammar> element or automatically generated by the processdescribed inSection2.2.5.
It may specify a DTMF grammar, as discussed inSection 2.2.3.
The contents may be used to form the <enumerate> promptstring. This is described inSection 2.2.4.
And it specifies either an event to be thrown or the URI to goto when the choice is selected.
The choice element's attributes are:
dtmf | The DTMF sequence for this choice. Itis equivalent to a simple DTMF <grammar> and DTMFproperties (Section 6.3.3)apply to recognition of the sequence. Unlike DTMF grammars,whitespace is optional: dtmf="123#" is equivalent to dtmf="1 2 3#". |
---|---|
accept | Override the setting for accept in<menu> for this particular choice. When set to "exact" (thedefault), the text of the choice element defines the exact phraseto be recognized. When set to "approximate", the text of thechoice element defines an approximate recognition phrase (asdescribed underSection2.2.5). |
next | The URI of next dialog ordocument. |
expr | Specify an expression to evaluate asa URI to transition to instead of specifying a next. |
event | Specify an event to be thrown insteadof specifying a next. |
eventexpr | An ECMAScript expression evaluatingto the name of the event to be thrown. |
message | A message string providing additionalcontext about the event being thrown. The message is available asthe value of a variable within the scope of the catch element,seeSection 5.2.2. |
messageexpr | An ECMAScript expression evaluatingto the message string. |
fetchaudio | SeeSection 6.1. This defaults to the fetchaudioproperty. |
fetchhint | SeeSection 6.1. This defaults to thedocumentfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the documentmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to thedocumentmaxstale property. |
Exactly one of "next", "expr", "event" or "eventexpr" must bespecified; otherwise, an error.badfetch event is thrown. Exactlyone of "message" or "messageexpr" may be specified; otherwise, anerror.badfetch event is thrown.
If a <grammar> element is specified in <choice>,then the external grammar is used instead of an automaticallygenerated grammar. This allows the developer to precisely controlthe <choice> grammar; for example:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu> <choice next="http://www.sports.example.com/vxml/start.vxml"> <grammar src="sports.grxml" type="application/srgs+xml"/> Sports </choice> <choice next="http://www.weather.example.com/intro.vxml"> <grammar src="weather.grxml" type="application/srgs+xml"/> Weather </choice> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"> <grammar src="astronews.grxml" type="application/srgs+xml"/> Stargazer astrophysics </choice></menu></vxml>
Menus can rely purely on speech, purely on DTMF, or both incombination by including a <property> element in the<menu>. Here is a DTMF-only menu with explicit DTMFsequences given to each choice, using the choice's dtmfattribute:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu> <property name="inputmodes" value="dtmf"/> <prompt> For sports press 1, For weather press 2, For Stargazer astrophysics press 3. </prompt> <choice dtmf="1" next="http://www.sports.example.com/vxml/start.vxml"/> <choice dtmf="2" next="http://www.weather.example.com/intro.vxml"/> <choice dtmf="3" next="http://www.stargazer.example.com/astronews.vxml"/></menu></vxml>
Alternatively, you can set the <menu>'s dtmfattribute to true to assign sequential DTMF digits to each of thefirst nine choices that have not specified their own DTMFsequences: the first choice has DTMF "1", and so on:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu dtmf="true"> <property name="inputmodes" value="dtmf"/> <prompt> For sports press 1, For weather press 2, For Stargazer astrophysics press 3. </prompt> <choice next="http://www.sports.example.com/vxml/start.vxml"/> <choice next="http://www.weather.example.com/intro.vxml"/> <choice dtmf="0" next="#operator"/> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"/></menu></vxml>
The <enumerate> element is an automatically generateddescription of the choices available to the user. It specifies atemplate that is applied to each choice in the order they appearin the menu. If it is used with no content, a default templatethat lists all the choices is used, determined by the interpretercontext. If it has content, the content is the templatespecifier. This specifier may refer to two special variables:_prompt is the choice's prompt, and _dtmf is a normalizedrepresentation (i.e. a single whitespace between DTMF tokens)of the choice's assigned DTMF sequence (note that if no DTMFsequence is assigned to the choice element, or if a<grammar> element is specified in <choice>, thenthe _dtmf variable is assigned the ECMAScript undefined value ).For example, if the menu were rewritten as
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu dtmf="true"> <prompt> Welcome home. <enumerate> For <value expr="_prompt"/>, press <value expr="_dtmf"/>. </enumerate> </prompt> <choice next="http://www.sports.example.com/vxml/start.vxml"> sports </choice> <choice next="http://www.weather.example.com/intro.vxml"> weather </choice> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"> Stargazer astrophysics news </choice></menu></vxml>
then the menu's prompt would be:
C: Welcome home. For sports, press 1. For weather, press 2.For Stargazer astrophysics news, press 3.
The <enumerate> element may be used within the promptsand the catch elements associated with <menu> elements andwith <field> elements that contain <option> elements,as discussed inSection2.3.1.3. An error.semantic event is thrown if<enumerate> is used elsewhere (for example,<enumerate> within an <enumerate>).
Anychoice phrase specifies a set of words and phrasesto listen for. A choice phrase is constructed from the PCDATA ofthe elements contained directly or indirectly in a <choice>element of a <menu>, or in the <option> element of a<field>.
If the accept attribute is "exact" then the user must say theentire phrase in the same order in which they occur in the choicephrase.
If the accept attribute is "approximate", then the choice maybe matched when a user says a subphrase of the expression. Forexample, in response to the prompt "Stargazer astrophysics news"a user could say "Stargazer", "astrophysics", "Stargazer news","astrophysics news", and so on. The equivalent grammar may belanguage and platform dependent.
As an example of using "exact" and "approximate" in differentchoices, consider this example:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><menu accept="approximate"> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"> Stargazer Astrophysics News </choice> <choice accept="exact" next="http://www.physicsweekly.com/voice/example.vxml"> Physics Weekly </choice> <choice accept="exact" next="http://www.particlephysics.com/voice/example.vxml"> Particle Physics Update </choice> <choice next="http://www.astronomytoday.com/voice/example.vxml"> Astronomy Today </choice></menu></vxml>
Because "approximate" is specified for the first choice, theuser may say a subphrase when matching the first choice; forinstance, "Stargazer" or "Astrophysics News". However, because"exact" is specified in the second and third choices, only acomplete phrase will match: "Physics Weekly" and "ParticlePhysics Update".
A menu behaves like a form with a single field that does allthe work. The menu prompts become field prompts. The menu eventhandlers become the field event handlers. The menu grammarsbecomeform grammars. As with forms, grammar matches inmenu will update the application.lastresult$ array. Thesevariables are described inSection 5.1.5. Generated grammars must alwaysproduce simple results whose interpretation and utterance valuesare identical.
Upon entry, the menu's grammars are built and enabled,and the prompt is played. When the user input matches a choice,control transitions according to the value of the next, expr,event or eventexpr attribute of the <choice>, only one ofwhich may be specified. If an event attribute is specified butits event handler does not cause the interpreter to exit ortransition control, then the FIA will clear the form itemvariable of the menu's anonymous field, causing the menu to beexecuted again.
Aform item is an element of a <form> that can bevisited during form interpretation. These elements are<field>, <block>, <initial>, <subdialog>,<object>, <record>, and <transfer>.
All form items have the following characteristics:
They have a result variable, specified by the name attribute.This variable may be given an initial value with the exprattribute.
They have a guard condition specified with the cond attribute.A form item is visited if it is not filled and its cond is notspecified or evaluates, after conversion to boolean, to true.
Form items are subdivided intoinput items, those thatdefine the form's input item variables, andcontrolitems, those that help control the gathering of theform's input items. Input items (<field>,<subdialog>, <object>, <record>, and<transfer>) generally may contain the followingelements:
<filled> elements containing some action to executeafter the result input item variable is filled in.
<property> elements to specify properties that are ineffect for this input item (the <initial> form item canalso contain this element).
<prompt> elements to specify prompts to be played whenthis element is visited.
<grammar> elements to specify allowable spoken andcharacter input for this input item (<subdialog> and<object> cannot contain this element).
<catch> elements and catch shorthands that are in effectfor this input item (the <initial> form item can alsocontain this element).
Each input item may have an associated set ofshadowvariables. Shadow variables are used to return results fromthe execution of an input item, other than the value stored underthe name attribute. For example, it may be useful to know theconfidence level that was obtained as a result of a recognizedgrammar in a <field> element. A shadow variable isreferenced asname$.shadowvar wherename isthe value of the form item's name attribute, andshadowvar is the name of a specific shadow variable.Shadow variables are writeable and can be modified by theapplication. For example, the <field> element returns ashadow variable confidence. The example below illustrates howthis shadow variable is accessed.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form><field name="state"> <prompt> Please say the name of a state. </prompt> <grammar src="http://mygrammars.example.com/states.gram" type="application/srgs"/> <filled> <if cond="state$.confidence < 0.4"> <throw event="nomatch"/> </if> </filled></field></form></vxml>
In the example, the confidence of the result is examined, andthe result is rejected if the confidence is too low.
A field specifies an input item to be gathered from the user.The field element's attributes are:
name | The form item variable in the dialogscope that will hold the result. The name must be unique amongform items in the form. If the name is not unique, then abadfetch error is thrown when the document is fetched. The namemust conform to the variable naming conventions inSection 5.1. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. The form item can also be visited if the attribute isnot specified. |
type | The type of field, i.e., the name ofa builtin grammar type (seeAppendix P). Platform support for builtingrammar types is optional. If the specified builtin type isnot supported by the platform, an error.unsupported.builtin eventis thrown. |
slot | The name of the grammar slot used topopulate the variable (if it is absent, it defaults to thevariable name). This attribute is useful in the case where thegrammar format being used has a mechanism for returning sets ofslot/value pairs and the slot names differ from the form itemvariable names. |
modal | If this is false (the default) allactive grammars are turned on while collecting this field. Ifthis is true, then only the field's grammars are enabled:all others are temporarily disabled. |
The shadow variables of a <field> element with the namename are given in Table 10. The values of the utterance,inputmode and interpretation shadow variables mustbe the same as those in application.lastresult$ (seeSection 5.1.5).
name$.utterance | The raw string of words that wererecognized. The exact tokenization and spelling isplatform-specific (e.g. "five hundred thirty" or "5 hundred 30"or even "530"). In the case of a DTMF grammar, this variable willcontain the matched digit string. |
---|---|
name$.inputmode | The mode in which user input wasprovided: dtmf or voice. |
name$.interpretation | An ECMAScript variable containing theinterpretation as described inSection 3.1.5. |
name$.confidence | The confidence level for thename field and may rangefrom 0.0-1.0. A value of 0.0 indicates minimum confidence, and avalue of 1.0 indicates maximum confidence. A platform may use the utterance confidence (the value ofapplication.lastresult$.confidence) as the value ofname$.confidence. This distinction between field andutterance level confidence is platform-dependent. More specific interpretation of a confidence value isplatform-dependent since its computation is likely to differbetween platforms. |
Explicit grammars can be specified via a URI, which can beabsolute or relative:
<field name="flavor"> <prompt>What is your favorite ice cream?</prompt> <grammar src="../grammars/ice_cream.grxml" type="application/srgs+xml"/></field>
Grammars can be specified inline, for example using a W3C ABNFgrammar:
<field name="flavor"> <prompt>What is your favorite flavor?</prompt> <help>Say one of vanilla, chocolate, or strawberry.</help> <grammar mode="voice" type="application/srgs"> #ABNF 1.0; $options = vanilla | chocolate | strawberry </grammar></field>
If both the <grammar> src attribute and an inlinegrammar are specified, then an error.badfetch is thrown.
Platform support for builtin resources such as speechgrammars, DTMF grammars and audio files is optional. Theseresources are accessed using platform-specific URIs, such as"http://localhost:5000/grammar/boolean", or platform-specificschemes such as the commonly used 'builtin' scheme,"builtin:grammar/boolean".
If a platform supports access to builtin resources, then itshould support access to fundamental builtin grammars (seeAppendix P); forexample
<grammar src="builtin:grammar/boolean"/><grammar src="builtin:dtmf/boolean"/>
where the first <grammar> references the builtin booleanspeech grammar, and the second references the builtin booleanDTMF grammar.
By definition the following:
<field type="sample"> <prompt>Prompt for builtin grammar</prompt></field>
is equivalent to the following platform-specific builtingrammars :
<field> <grammar src="builtin:grammar/sample"/> <grammar src="builtin:dtmf/sample"/> <prompt>Prompt for builtin grammar</prompt></field>
wheresample is one of the fundamental builtin fieldtypes (e.g., boolean, date, etc.).
In addition, platform-specific builtin URI schemes may be usedto access grammars that are supported by particular interpretercontexts. It is recommended that platform-specific builtin grammarnames begin with the string "x-", as this namespace will not beused in future versions of the standard.
Examples of platform-specific builtin grammars:
<grammar src="builtin:grammar/x-sample"/><grammar src="builtin:dtmf/x-sample"/>
When a simple set of alternatives is all that is needed tospecify the legal input values for a field, it may be moreconvenient to use an option list than a grammar. An option listis represented by a set of <option> elements contained in a<field> element. Each <option> element containsPCDATA that is used to generate a speech grammar. This followsthe grammar generation method described for <choice> inSection 2.2.5 . Attributes maybe used to specify a DTMF sequence for each option and to controlthe value assigned to the field's form item variable. When anoption is chosen, the value attribute determines theinterpretation value for the field's shadow variable and forapplication.lastresult$.
The following field offers the user three choices and assignsthe value of the value attribute of the selected option to themaincourse variable:
<field name="maincourse"> <prompt> Please select an entree. Today, we are featuring <enumerate/> </prompt> <option dtmf="1" value="fish"> swordfish </option> <option dtmf="2" value="beef"> roast beef </option> <option dtmf="3" value="chicken"> frog legs </option> <filled> <submit next="/cgi-bin/maincourse.cgi" method="post" namelist="maincourse"/> </filled></field>
This conversation might sound like:
C: Please select an entree. Today, we're featuringswordfish; roast beef; frog legs.
H: frog legs
C:(assigns "chicken" to "maincourse",then submits "maincourse=chicken" to /maincourse.cgi)
The following example shows proper and improper use of<enumerate> in a catch element of a form with severalfields containing <option> elements:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <block> We need a few more details to complete your order. </block> <field name="color"> <prompt>Which color?</prompt> <option>red</option> <option>blue</option> <option>green</option> </field> <field name="size"> <prompt>Which size?</prompt> <option>small</option> <option>medium</option> <option>large</option> </field> <field name="quantity"> <grammar type="application/srgs+xml" src="/grammars/number.grxml"/> <prompt>How many?</prompt> </field> <block> Thank you. Your order is being processed. <submit next="details.cgi" namelist="color size quantity"/> </block> <catch event="help nomatch"> Your options are <enumerate/>. </catch></form></vxml>
A scenario might be:
C: We need a few more details to complete your order. Whichcolor?
H: help. (throws "help" event caught by form-level<catch>)
C: Your options are red, blue, green.
H: red.
C: Which size?
H: 7 (throws "nomatch" event caught by form-level<catch>)
C: Your options are small, medium, large.
H: small.
In the steps above, the <enumerate/> in the form-levelcatch had something to enumerate: the <option> elements inthe "color" and "size" <field> elements. The next<field>, however, is different:
C: How many?
H: a lot. (throws "nomatch" event caught by form-level<catch>)
The form-level <catch>'s use of <enumerate> causesan "error.semantic" event to be thrown because the "quantity"<field> does not contain any <option> elements thatcan be enumerated.
One solution is to add a field-level <catch> to the"quantity" <field>:
<catch event="help nomatch"> Please say the number of items to be ordered.</catch>
The "nomatch" event would then be caught locally, resulting inthe following possible completion of the scenario:
C: Please say the number of items to be ordered.
H: 50
C: Thank you. Your order is being processed.
The <enumerate> element is also discussed inSection 2.2.4.
The attributes of <option> are:
dtmf | An optional DTMF sequence forthis option. It is equivalent to a simple DTMF<grammar> and DTMF properties (Section 6.3.3) apply to recognition of thesequence. Unlike DTMF grammars, whitespace is optional:dtmf="123#" is equivalent to dtmf="1 2 3 #". If unspecified,no DTMF sequence is associated with this option so it cannotbe matched using DTMF. |
---|---|
accept | When set to "exact" (the default),the text of the option element defines the exact phrase to berecognized. When set to "approximate", the text of the optionelement defines an approximate recognition phrase (as describedinSection 2.2.5). |
value | The string to assign to thefield's form item variable when a user selects this option,whether by speech or DTMF. The default assignment is the CDATAcontent of the <option> element with leading and trailingwhite space removed. If this does not exist, then the DTMFsequence is used instead. If neither CDATA content nor adtmf sequence is specified, then the default assignment isundefined and the field's form item variable is notfilled. |
The use of <option> does not preclude the simultaneoususe of <grammar>. The result would be the match from either'grammar', not unlike the occurrence of two <grammar>elements in the same <field> representing a disjunction ofchoices.
This element is a form item. It contains executable contentthat is executed if the block's form item variable isundefined and the block's cond attribute, if any, evaluates totrue.
<block> Welcome to Flamingo, your source for lawn ornaments.</block>
The form item variable is automatically set to true justbefore the block is entered. Therefore, blocks are typicallyexecuted just once per form invocation.
Sometimes you may need more control over blocks. To do this,you can name the form item variable, and set or clear it tocontrol execution of the <block>. This variable is declaredin the dialog scope of the form.
Attributes of <block> include:
name | The name of the form item variableused to track whether this block is eligible to be executed;defaults to an inaccessible internal variable. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. |
In a typical mixed initiative form, an <initial> elementis visited when the user is initially being prompted forform-wide information, and has not yet entered into the directedmode where each field is visited individually. Like input items,it has prompts, catches, and event counters. Unlike input items,<initial> has no grammars, and no <filled> action.For instance:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <grammar src="http://www.directions.example.com/grammars/from_to.grxml" type="application/srgs+xml"/> <block> Welcome to the Driving Directions By Phone. </block> <initial name="bypass_init"> <prompt> Where do you want to drive from and to? </prompt> <nomatch count="1"> Please say something like "from Atlanta Georgia to Toledo Ohio". </nomatch> <nomatch count="2"> I'm sorry, I still don't understand. I'll ask you for information one piece at a time. <assign name="bypass_init" expr="true"/> <reprompt/> </nomatch> </initial> <field name="from_city"> <grammar src="http://www.directions.example.com/grammars/city.grxml" type="application/srgs+xml"/> <prompt>From which city are you leaving?</prompt> </field> <field name="to_city"> <grammar src="http://www.directions.example.com/grammars/city.grxml" type="application/srgs+xml"/> <prompt>Which city are you going to?</prompt> </field></form></vxml>
If an event occurs while visiting an <initial>, then oneof its event handlers executes. As with other form items,<initial> continues to be eligible to be visited while itsform item variable is undefined and while its cond attribute istrue. If one or more of the input item variables is set by userinput, then all <initial> form item variables are set totrue, before any <filled> actions are executed.
An <initial> form item variable can be manipulatedexplicitly to disable, or re-enable the <initial>'seligibility to the FIA. For example, in the program above, the<initial>'s form item variable is set on the second nomatchevent. This causes the FIA to no longer consider the<initial> and to choose the next form item, which is a<field> to prompt explicitly for the origination city.Similarly, an <initial>'s form item variable could becleared, so that <initial> gets selected again by theFIA.
More than one <initial> may be specified in the sameform. When the form is entered only the first <initial> indocument order that is eligible according to its cond attributewill be visited. After the first form item variable is filled,all <initial> form item variables are set to true so thatthey are not visited. Explicitly clearing the <initial>scan allow them to be reused, and even allow a different<initial> to be selected on subsequent iterations of theFIA.
The cond attribute can also be used to select which<initial> to use in a given iteration. An application couldprovide multiple <initial>s but mark them for use onlyunder special circumstances by using their cond attribute; forexample, if the cond attribute were used to test for noviceversus advanced operation mode, and only use the <initial>sin advanced mode. Furthermore, if the first <initial> indocument order specified a value for its cond attribute which wasnever fulfilled, then it would never be executed. If all<initial>s had cond values which prevented their selection,then none would be executed.
Normal grammar scoping rules apply when visiting an<initial>, as described inSection 3.1.3.. In particular, no grammarsscoped to a <field> are active.
Note: explicit assignment of values to input item variablesdoes not affect the value of an <initial>'s form itemvariable.
Attributes of <initial> include:
name | The name of a form item variable usedto track whether the <initial> is eligible to execute;defaults to an inaccessible internal variable. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. |
Subdialogs are a mechanism for reusing common dialogs andbuilding libraries of reusable applications.
The <subdialog> element invokes a 'called' dialog (knownas thesubdialog) identified by its src or srcexprattribute in the 'calling' dialog. The subdialog executes in anew execution context that includes all the declarations andstate information for the subdialog, the subdialog's document,and the subdialog's application root (if present), with countersreset, and variables initialized. The subdialog proceeds untilthe execution of a <return> or <exit> element, oruntil no form items remain eligible for the FIA to select(equivalent to an <exit>). A <return> element causescontrol and data to be returned to the calling dialog (Section 5.3.10). When thesubdialog returns, its execution context is deleted, andexecution resumes in the calling dialog with any appropriate<filled> elements.
The subdialog context and the context of the called dialog areindependent, even if the dialogs are in the same document.Variables in the scope chain of the calling dialog are not sharedwith the called subdialog: there is no sharing of variableinstances between execution contexts. Even when the subdialog isspecified in the same document as the calling dialog, itsexecution context contains different variable instances. When thesubdialog and calling dialog are in different documents but sharea root document, the subdialog's root variables are likewisedifferent instances. All variable bindings applied in thesubdialog context are lost on return to the calling context.
Within the subdialog context, however, normal scoping rulesfor grammars, events and variables apply. Active grammars in asubdialog include default grammars defined by the interpretercontext and appropriately scoped grammars in <link>,<menu> and <form> elements in the subdialog'sdocument and its root document. Event handling and variablebinding likewise follow the standard scoping hierarchy.
From a programming perspective, subdialogs behave differentlyfrom subroutines because the calling and called contexts areindependent. While a subroutine can access variable instances inits calling routine, a subdialog cannot access the same variableinstance defined in its calling dialog. Similarly, subdialogs donot follow the event percolation model in languages like Javawhere an event thrown in a method automatically percolates up tothe calling context if not handled in the called context. Eventsthrown in a subdialog are treated by event handlers definedwithin its context; they can only be passed to the callingcontext by a local event handler which explicitly returns theevent to the calling context (seeSection 5.3.10).
The subdialog is specified by the URI reference in the<subdialog>'s src or srcexpr attribute (see[RFC2396]). If thisURI reference contains an absolute or relative URI, which mayinclude a query string, then that URI is fetched and thesubdialog is found in the resulting document. If the<subdialog> has a namelist attribute, then those variablesare added to the query string of the URI.
If the URI reference contains only a fragment (i.e., noabsolute or relative URI), and if there is no namelist attribute,then there is no fetch: the subdialog is found in the currentdocument.
The URI reference's fragment, if any, specifies the subdialogto invoke. When there is no fragment, the subdialog invoked isthe lexically first dialog in the document.
If the URI reference is not valid (i.e. the dialog or documentdoes not exist), an error.badfetch must be thrown. Note that forerrors which occur during a dialog or document transition, thescope in which errors are handled is platform specific.
The attributes are:
name | The result returned from thesubdialog, an ECMAScript object whose properties are the onesdefined in the namelist attribute of the <return>element. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. |
namelist | The list of variables to submit. Thedefault is to submit no variables. If a namelist is supplied, itmay contain individual variable references which are submittedwith the same qualification used in the namelist. DeclaredVoiceXML and ECMAScript variables can be referenced. If anundeclared variable is referenced in the namelist, then anerror.semantic is thrown (Section 5.1.1). |
src | The URI of the subdialog. |
srcexpr | An ECMAScript expression yielding theURI of the subdialog |
method | SeeSection 5.3.8. |
enctype | SeeSection 5.3.8. |
fetchaudio | SeeSection 6.1. This defaults to the fetchaudioproperty. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
fetchhint | SeeSection 6.1. This defaults to thedocumentfetchhint property |
maxage | SeeSection 6.1. This defaults to the documentmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to thedocumentmaxstale property. |
Exactly one of "src" or "srcexpr" must be specified;otherwise, an error.badfetch event is thrown.
The <subdialog> element may contain elements common toall form items, and may also contain <param> elements.The <param> elements of a <subdialog> specify theparameters to pass to the subdialog. These parameters must bedeclared as <var> elements in the form executed asthe subdialog or an error.semantic will be thrown. When asubdialog initializes, the subdialog's form level<var> elements are initialized in document order tothe value specified by the <param> element with thecorresponding name. The parameter values are computed byevaluating the <param> expr attribute in the context ofthe <param> element. An expr attribute in the <var>element is ignored in this case. If no corresponding<param> is specified to <var> element, an exprattribute is used as a default value, or the variable isundefined if the expr attribute is unspecified as with theregular <form> element.
In the example below, the birthday of an individual is used tovalidate their driver's license. The src attribute of thesubdialog refers to a form that is within the same document. The<param> element is used to pass the birthday value to thesubdialog.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><!-- form dialog that calls a subdialog --><form> <subdialog name="result" src="#getdriverslicense"> <param name="birthday" expr="'2000-02-10'"/> <filled> <submit next="http://myservice.example.com/cgi-bin/process"/> </filled> </subdialog></form><!-- subdialog to get drivers license --><form> <var name="birthday"/> <field name="drivelicense"> <grammar src="http://grammarlib/drivegrammar.grxml" type="application/srgs+xml"/> <prompt> Please say your drivers license number. </prompt> <filled> <if cond="validdrivelicense(drivelicense,birthday)"> <var name="status" expr="true"/> <else/> <var name="status" expr="false"/> </if> <return namelist="drivelicense status"/> </filled> </field></form></vxml>
The driver's license value is returned to callingdialog, along with a status variable in order to indicate whetherthe license is valid or not.
This example also illustrates the convenience of using<param> as a means for forwarding data to the subdialog asa means of instantiating values in the subdialog without usingserver side scripting. An alternate solution that uses scripting,is shown below.
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <field name="birthday"> <grammar type="application/srgs+xml" src="/grammars/date.grxml"/> What is your birthday? </field> <subdialog name="result" src="/cgi-bin/getlib#getdriverslicense" namelist="birthday"> <filled> <submit next="http://myservice.example.com/cgi-bin/process"/> </filled> </subdialog> </form></vxml>
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <var name="birthday" expr="'1980-02-10'"/> <!-- Generated by server script --> <field name="drivelicense"> <grammar src="http://grammarlib/drivegrammar.grxml" type="application/srgs+xml"/> <prompt> Please say your drivers license number. </prompt> <filled> <if cond="validdrivelicense(drivelicense,birthday)"> <var name="status" expr="true"/> <else/> <var name="status" expr="false"/> </if> <return namelist="drivelicense status"/> </filled> </field> </form></vxml>
In the above example, a server side script had to generate thedocument and embed the birthday value.
One last example is shown below that illustrates a subdialogto capture general credit card information. First the subdialogis defined in a separate document; it is intended to be reusableacross different applications. It returns a status, the creditcard number, and the expiry date; if a result cannot be obtained,the status is returned with value "no_result".
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <!-- Example of subdialog to collect credit card information. --> <!-- file is at http://www.somedomain.example.com/ccn.vxml --> <form> <var name="status" expr="'no_result'"/> <field name="creditcardnum"> <prompt> What is your credit card number? </prompt> <help> I am trying to collect your credit card information. <reprompt/> </help> <nomatch> <return namelist="status"/> </nomatch> <grammar src="ccn.grxml" type="application/srgs+xml"/> </field> <field name="expirydate"> <grammar type="application/srgs+xml" src="/grammars/date.grxml"/> <prompt> What is the expiry date of this card? </prompt> <help> I am trying to collect the expiry date of the credit card number you provided. <reprompt/> </help> <nomatch> <return namelist="status"/> </nomatch> </field> <block> <assign name="status" expr="'result'"/> <return namelist="status creditcardnum expirydate"/> </block> </form></vxml>
An application that includes a calling dialog is shown below.It obtains the name of a software product and operating systemusing a mixed initiative dialog, and then solicits credit cardinformation using the subdialog.
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <!-- Example main program --> <!-- http://www.somedomain.example.com/main.vxml --> <!-- calls subdialog ccn.vxml --> <!-- assume this gets defined by some dialog --> <var name="username"/> <form> <var name="ccn"/> <var name="exp"/> <grammar src="buysoftware.grxml" type="application/srgs+xml"/> <initial name="start"> <prompt> Please tell us the software product you wish to buy and the operating system on which it must run. </prompt> <noinput> <assign name="start" expr="true"/> </noinput> </initial> <field name="product"> <prompt> Which software product would you like to buy? </prompt> </field> <field name="operatingsystem"> <prompt> Which operating system does this software need to run on? </prompt> </field> <subdialog name="cc_results" src="http://somedomain.example.com/ccn.vxml"> <filled> <if cond="cc_results.status=='no_result'"> Sorry, your credit card information could not be Obtained. This order is cancelled. <exit/> <else/> <assign name="ccn" expr="cc_results.creditcardnum"/> <assign name="exp" expr="cc_results.expirydate"/> </if> </filled> </subdialog> <block> We will now process your order. Please hold. <submit next="www.somedomain.example.com/process_order.asp" namelist="username product operatingsystem ccn exp"/> </block> </form></vxml>
A VoiceXML implementation platform may exposeplatform-specific functionality for use by a VoiceXML applicationvia the <object> element. The <object> element makesdirect use of its own content during initialization (e.g.<param> child element) and execution. As a result,<object> content cannot be treated as alternative content.Notice that like other input items, <object> has promptsand catch elements. It may also have <filled> actions.
For example, a platform-specific credit card collection objectcould be accessed like this:
<object name="debit" classid="method://credit-card/gather_and_debit" data="http://www.recordings.example.com/prompts/credit/jesse.jar"> <param name="amount" expr="document.amt"/> <param name="vendor" expr="vendor_num"/></object>
In this example, the <param> element (Section 6.4) is used to pass parameters to theobject when it is invoked. When this <object> is executed,it returns an ECMAScript object as the value of its form itemvariable. This <block> presents the values returned fromthe credit card object:
<block> <prompt> The card type is <value expr="debit.card"/>. </prompt> <prompt> The card number is <value expr="debit.card_no"/>. </prompt> <prompt> The expiration date is <value expr="debit.expiry_date"/>. </prompt> <prompt> The approval code is <value expr="debit.approval_code"/>. </prompt> <prompt>The confirmation number is <value expr="debit.conf_no"/>. </prompt></block>
As another example, suppose that a platform has a feature thatallows the user to enter arbitrary text messages using atelephone keypad.
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <object name="message" classid="builtin://keypad-text-input"> <prompt> Enter your message by pressing your keypad once per letter. For a space, enter star. To end the message, press the pound sign. </prompt> </object> <block> <assign name="document.pager_message" expr="message.text"/> <goto next="#confirm_pager_message"/> </block></form></vxml>
The user is first prompted for the pager message, then keys itin. The <block> copies the message to the variabledocument.pager_message.
Attributes of <object> include:
name | When the object is evaluated, it setsthis variable to an ECMAScript value whose type is defined by theobject. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. |
classid | The URI specifying the location ofthe object's implementation. The URI conventions areplatform-dependent. |
codebase | The base path used to resolverelative URIs specified by classid, data, and archive. Itdefaults to the base URI of the current document. |
codetype | The content type of data expectedwhen downloading the object specified by classid. When absent itdefaults to the value of the type attribute. |
data | The URI specifying the location ofthe object's data. If it is a relative URI, it isinterpreted relative to the codebase attribute. |
type | The content type of the dataspecified by the data attribute. |
archive | A space-separated list of URIs forarchives containing resources relevant to the object, which mayinclude the resources specified by the classid and dataattributes. URIs which are relative are interpreted relative tothe codebase attribute. |
fetchhint | SeeSection 6.1. This defaults to theobjectfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the objectmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to the objectmaxstaleproperty. |
There is no requirement for implementations to provideplatform-specific objects, although implementations must handlethe <object> element by throwingerror.unsupported.objectname if the particular platform-specificobject is not supported (note that 'objectname' inerror.unsupported.objectname is a fixed string, so notsubstituted with the name of the unsupported object). Ifan implementation does this, it is considered to be supportingthe <object> element.
The object itself is responsible for determining whetherparameter names or values it receives are invalid. If so, the<object> element throws an error. The error may be eitherobject-specific or one of the standard errors listed inSection 5.2.6.
The <record> element is an input item that collects arecording from the user. A reference to the recorded audio isstored in the input item variable, which can be played back(using the expr attribute on <audio>) or submitted to aserver, as shown in this example:
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <property name="bargein" value="true"/> <block> <prompt> Riley is not available to take your call. </prompt> </block> <record name="msg" beep="true" maxtime="10s" finalsilence="4000ms" dtmfterm="true" type="audio/x-wav"> <prompt timeout="5s"> Record a message after the beep. </prompt> <noinput> I didn't hear anything, please try again. </noinput> </record> <field name="confirm"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt> Your message is <audio expr="msg"/>. </prompt> <prompt> To keep it, say yes. To discard it, say no. </prompt> <filled> <if cond="confirm"> <submit next="save_message.pl" enctype="multipart/form-data" method="post" namelist="msg"/> </if> <clear/> </filled> </field> </form></vxml>
The user is prompted to record a message, and then records it.The recording terminates when one of the following conditions ismet: the interval of final silence occurs, a DTMF key is pressed,the maximum recording time is exceeded, or the caller hangs up.The recording is played back, and if the user approves it, issent on to the server for storage using the HTTP POST method.Notice that like other input items, <record> has grammar,prompt and catch elements. It may also have <filled>actions.
Figure 7: Timing of prompts, audio recording, and DTMF input
When a user hangs up during recording, the recordingterminates and a connection.disconnect.hangup event is thrown.However, audio recorded up until the hangup is available throughthe <record> variable. Applications, such as simplevoicemail services, can then return audio data to a server evenafter disconnection:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <record name="msg" beep="true" maxtime="10s" finalsilence="4000ms" dtmfterm="true" type="audio/x-wav"> <prompt timeout="5s"> Record a message after the beep. </prompt> <noinput> I didn't hear anything, please try again. </noinput> <catch event="connection.disconnect.hangup"> <submit next="./voicemail_server.asp"/> </catch> </record></form></vxml>
Arecording begins at the earliest after the playbackof any prompts (including the 'beep' tone if defined). As anoptimization, a platform may begin recording when the user startsspeaking.
A timeout interval is defined to begin immediately afterprompt playback (including the 'beep' tone if defined) and itsduration is determined by the 'timeout' property. If the timeoutinterval is exceeded before recording begins, then a<noinput> event is thrown.
A maxtime interval is defined to begin when recording beginsand its duration is determined by a 'maxtime' attribute. If themaxtime interval is exceeded before recording ends, then therecording is terminated and the maxtime shadow variable is set to'true'.
Arecording ends when an event is thrown, DTMF orspeech input matches an active grammar, or the maxtime intervalis exceeded. As an optimization, a platform may end recordingafter a silence interval (set by the 'finalsilence' attribute)indicating the user has stopped speaking.
If no audio is collected during execution of <record>,then the record variable remains unfilled (note). This can occur,for example, when DTMF or speech input is received during promptplayback or before the timeout interval expires. In particular,if no audio is collected before the user terminates recordingwith DTMF input matching a local DTMF grammar (or when thedtmfterm attribute is set to true), then the record variable isnot filled (so shadow variables are not set), and the FIA appliesas normal without a noinput event being thrown. However,information about the input may be available in these situationsvia application.lastresult$as described inSection5.1.5.
The <record> element contains a 'dtmfterm' attribute asa developer convenience. A 'dtmfterm' attribute with the value'true' is equivalent to the definition of a local DTMF grammarwhich matches any DTMF input. The dtmfterm attribute haspriority over specified local DTMF grammars.
Any DTMF keypress matching an active grammar terminatesrecording. DTMF keypresses not matching an active grammar areignored (and therefore do not terminate or otherwise affectrecording) and may optionally be removed from the signal by theplatform.
Platform support for recognition of speech grammars duringrecording is optional. If the platform supports simultaneousrecognition and recording, then spoken input matching anactive non-local speech grammar terminates recording and theFIA is invoked, transferring execution to the elementcontaining the grammar. The 'terminating' speech input isaccessible via application.lastresult$. The audio of therecognized 'terminating' speech input is not available and isnot part of the recording. Note that, unlike DTMF, speechrecognition input cannot be used just to terminate recording:if local speech grammars are specified, they are treated asinactive (i.e. they are ignored), even if the platform supportssimultaneous recognition and recording.
If the termination grammar matched is a local grammar, therecording is placed in the record variable. Otherwise, the recordvariable is left unfilled (note) and the form interpretation algorithm isinvoked. In each case, application.lastresult$ is assigned.
note Although the record variable is notfilled with a recording in this case, a match of a non-localgrammar may nevertheless result in an assignment of some value tothe record variable (seeSection3.1.6).
The attributes of <record> are:
name | The input item variable that will hold the recording. Note that how this variable is implemented may vary betweenplatforms (although all platforms must support its behaviour in<audio> and <submit> as described in thisspecification). |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the form item to bevisited. |
modal | If this is true (the default) allnon-local speech and DTMF grammars are not active while makingthe recording. If this is false, non-local speech and DTMFgrammars are active. |
beep | If true, a tone is emitted just priorto recording. Defaults to false. |
maxtime | The maximum duration to record.The value is a Time Designation (seeSection 6.5). Defaults to aplatform-specific value. |
finalsilence | The interval of silence thatindicates end of speech. The value is a Time Designation(seeSection 6.5).Defaults to a platform-specific value. |
dtmfterm | If true, any DTMF keypress notmatched by an active grammar will be treated as a match of anactive (anonymous) local DTMF grammar. Defaults to true. |
type | The media format of the resultingrecording. Platforms must support the audio file formatsspecified inAppendixE (other formats may also be supported). Defaults to aplatform-specific format which should be one of the requiredformats. |
The <record> element has the following shadow variablesset after the recording has been made:
name$.duration | The duration of the recording inmilliseconds. |
---|---|
name$.size | The size of the recording inbytes. |
name$.termchar | If the dtmfterm attribute is true,and the user terminates the recording by pressing a DTMF key,then this shadow variable is the key pressed (e.g. "#").Otherwise it is undefined. |
name$.maxtime | Boolean, true if the recording wasterminated because the maxtime duration was reached. |
The <transfer> element directs the interpreter toconnect the caller to another entity (e.g. telephone line oranother voice application). During the transfer operation, thecurrent interpreter session is suspended.
There are a variety of ways an implementation platform caninitiate a transfer, including "bridge", "blind", network-basedredirect (sometimes referred to as "take back and transfer"),"switchhook transfer", etc. Bridge and blind transfer types aresupported; the others are highly dependent upon specific platformand network features and configuration and therefore are outsidethe scope of this specification.
The <transfer> element is optional, though platformsshould support it. Platforms that support <transfer> maysupport bridge or blind transfer types, or both. Platforms thatsupport either type of transfer may optionally support bargeininput modes of DTMF, speech recognition, or both, during the calltransfer to drop the far-end. Blind transfer attempts can onlybe cancelled up to the point the outgoing call begins.
Attributes are:
name | Stores the outcome of a bridgetransfer attempt. In the case of a blind transfer, this variableis undefined. |
---|---|
expr | The initial value of the form itemvariable; default is ECMAScript undefined. If initialized to avalue, then the form item will not be visited unless the formitem variable is cleared. |
cond | An expression that must evaluate totrue in order for the form item to be visited. |
dest | The URI of the destination(telephone, IP telephony address). Platforms must support thetel: URL syntax described in[RFC2806] and may support other URI-basedaddressing schemes. |
destexpr | An ECMAScript expression yielding theURI of the destination. |
bridge | Determines whether the platform remains in the connection withthe caller and callee.
|
connecttimeout | The time to wait while trying toconnect the call before returning the noanswer condition.The value is a Time Designation (seeSection 6.5). Only applies if bridge istrue. Default is platform specific. |
maxtime | The time that the callis allowed to last, or 0s if no limit is imposed. The value isa Time Designation (seeSection6.5). Only applies if bridge is true. Default is 0s. |
transferaudio | The URI of audio source to play while the transfer attempt isin progress (before far-end answer). If the resource cannot be fetched, the error is ignored andthe transfer continues; what the caller hears isplatform-dependent. |
aai | Application-to-application information. A string containingdata sent to an application on the far-end, available in thesession variable session.connection.aai. The transmission of aai data may depend upon signaling networkgateways and data translation (e.g. ISDN to SIP); the status ofdata sent to a remote site is not known or reported. Although all platforms must support the aai attribute,platforms are not required to send aai data and need not supportreceipt of aai data. Platforms that cannot receive aai data mustset the session.connection.aai variable to the ECMAScriptundefined value. The underlying transmission mechanism may imposedata length limits. |
aaiexpr | An ECMAScript expression yielding the AAI data. |
Exactly one of "dest" or "destexpr" may be specified;otherwise, an error.badfetch event is thrown. Likewise, exactlyone of "aai" or "aaiexpr" may be specified; otherwise, anerror.badfetch event is thrown.
With a blind transfer, an attempt is made to connect theoriginal caller with the callee. Any prompts preceding the<transfer>, as well as prompts within the <transfer>,are queued and played before the transfer attempt begins; bargeinproperties apply as normal.
Figure 8: Audio Connections during a blind transfer:<transfer bridge="false">
Any audio source specified by the transferaudio attribute isignored since no audio can be played from the platform to thecaller during the transfer attempt. Whether the connection issuccessful or not, the implementation platform cannot regaincontrol of the connections.
Connection status is not available. For example, it is notpossible to know whether the callee was busy, when a successfulcall ends, etc. However, some error conditions may be reported ifknown to the platform, such as if the caller is not authorized tocall the destination, or if the destination URI is malformed.These are platform-specific, but should follow the namingconvention of other transfer form item variable values.
The caller can cancel the transfer attempt before the outgoingcall begins by barging in with a speech or DTMF command thatmatches an active grammar during the playback of any queuedaudio.
In this case, the form item variable is set, and the followingshadow variables are set:
name$.duration | The duration of a call transfer in seconds. The duration is 0 if a call attempt was terminated by thecaller (using a voice or DTMF command) before the outgoing callbegins. |
---|---|
name$.inputmode | The input mode of the terminating command (dtmf or voice), orundefined if the transfer was not terminated by a grammarmatch. |
name$.utterance | The utterance text used if transferwas terminated by speech recognition input or the DTMF result ifthe transfer was terminated by DTMF input; otherwise it isundefined. |
Also, the application.lastresult$ variable will be filled asdescribed inSection5.1.5.
If the caller disconnects by hanging up during a call transferattempt before the connection to the callee begins, aconnection.disconnect.hangup event will be thrown, and dialogexecution will transition to a handler for the hangup event (ifone exists). The form item variable, and thus shadow variables,will not be set.
Once the transfer begins and the interpreter disconnects fromthe session, the platform throws connection.disconnect.transferand document interpretation continues normally.
Any connection between the caller and callee remains in placeregardless of document execution.
Action | Value of form | Event or Error | Reason |
---|---|---|---|
transfer begins | undefined | connection.disconnect.transfer | An attempt has been madeto transfer the caller to another line and will not return. |
caller cancels transferbefore outgoing call begins | near_end_disconnect | The caller cancelled thetransfer attempt via a DTMF or voice command before the outgoingcall begins (during playback of queued audio). | |
transfer ends | unknown | The transfer ended butthe reason is not known. |
For a bridge transfer, the platform connects the caller to thecallee in a full duplex conversation.
Figure 9: Audio Connections during a bridge transfer:<transfer bridge="true">
Any prompts preceding the <transfer>, as well asprompts within the <transfer>, are queued and played beforethe transfer attempt begins. The bargein control appliesnormally. Specification of bargeintype is ignored; "hotword" isset by default.
The caller can cancel the transfer attempt before the outgoingcall begins by barging in with a speech or DTMF command thatmatches an active grammar during the playback of any queuedaudio.
Platforms may optionally support listening for caller commandsto terminate the transfer by specifying one or more grammarsinside the <transfer> element. The <transfer>element is modal in that no grammar defined outside its scope isactive. The platform will monitor during playing of prompts andduring the entire length of the transfer connecting and talkingphases:
A successful match will terminate the transfer (the connectionto the callee); document interpretation continues normally.An unsuccessful match is ignored. If no grammars are specified,the platform will not listen to input from the caller.
The platform does not monitor in-band signals or voice inputfrom the callee.
While attempting to connect to the callee, the platformmonitors call progress indicators (in-band and/or out-of-band,depending upon the particular connection type and protocols). Forthe duration of a successful transfer, the platform monitors for(out-of-band) telephony events, such as disconnect, on both calllegs.
If the callee disconnects, the caller resumes his session withthe interpreter. If the caller disconnects, the platformdisconnects the callee, and document interpretation continuesnormally. If both the caller and callee are disconnected by thenetwork, document interpretation continues normally.
The possible outcomes for a bridge transfer before theconnection to the callee is established are:
Action | Value of form | Event | Reason |
---|---|---|---|
caller disconnects | connection.disconnect.hangup | The caller hung up. | |
caller disconnectscallee | near_end_disconnect | The caller forced thecallee to disconnect via a DTMF or voice command. | |
callee busy | busy | The callee was busy. | |
network busy | network_busy | An intermediate networkrefused the call. | |
callee does notanswer | noanswer | There was no answerwithin the time specified by the connecttimeout attribute. | |
--- | unknown | The transfer ended butthe reason is not known. |
The possible outcomes for a bridge transfer after theconnection to the callee is established are:
Action | Value of form | Event | Reason |
---|---|---|---|
caller disconnects | connection.disconnect.hangup | The caller hung up. | |
caller disconnects | near_end_disconnect | The caller forced thecallee to disconnect via a DTMF or voice command. | |
platform disconnectscallee | maxtime_disconnect | The callee was disconnected by the platform because the callduration reached the value of maxtime attribute. | |
network disconnectscallee | network_disconnect | The network disconnected the callee from the platform. | |
callee disconnects | far_end_disconnect | The callee hung up. | |
--- | unknown | The transfer ended butthe reason is not known. |
If the caller disconnects by hanging up (either during a calltransfer or call transfer attempt), the connection to the callee(if one exists) is dropped, a connection.disconnect.hangup eventwill be thrown, and dialog execution will transition to a handlerfor the hangup event (if one exists). The form item variable, andthus shadow variables, will not be set.
If execution of <transfer> continues normally, then itsform item variable is set, and the following shadow variableswill be set:
name$.duration | The duration of a call transfer in seconds. The duration is 0 if a call attempt was terminated by thecaller (using a voice or DTMF command) prior to beinganswered. |
---|---|
name$.inputmode | The input mode of the terminatingcommand (dtmf or voice) or undefined if the transfer was notterminated by a grammar match. |
name$.utterance | The utterance text used if transferwas terminated by speech recognition input or the DTMF resultif the transfer was terminated by DTMF input; otherwise it isundefined. |
If the transfer was terminated by speech recognition input,then application.lastresult$ is assigned as usual.
During a bridge transfer, it might be desirable to play audioto the caller while the platform attempts to connect to thecallee. For example, an advertisement ("Buy Joe's Spicy ShrimpSauce") or informational message ("Your call is very important tous; please wait while we connect you to the next availableagent.") might be provided in place of call progress information(ringing, busy, network announcements, etc.).
At the point the outgoing call begins, audio specifiedby transferaudio begins playing. Playing of transferaudioterminates when the answer status of the far-end connection isdetermined. This status isn't always known, since the far-endswitch can play audio (such as a special information tone, busytone, network busy tone, or a recording saying the connectioncan't be made) with out actually "answering" the call.
If a specified audio file play duration is shorter than thetime it takes to connect the far-end, the caller may hearsilence, platform-specific audio, or call progress information,depending upon the platform.
One of the following events may be thrown during atransfer:
Event | Reason | Transfer |
---|---|---|
connection.disconnect.hangup | The caller hung up. | bridge |
connection.disconnect.transfer | An attempt has been madeto transfer the caller to another line and will not return. | blind |
If a transfer attempt could not be made, one of the followingerrors will be thrown:
Error | Reason | Transfer |
---|---|---|
| The caller is not allowed to call thedestination. | blind and bridge |
| The destination URI ismalformed. | blind and bridge |
| The platform is not ableto place a call to the destination. | bridge |
| The platform cannotallocate resources to place the call. | bridge |
| The protocol stack forthis connection raised an exception that does not correspond toone of the other error.connection events. | bridge |
| The platform does notsupport blind transfer. | blind |
error.unsupported.transfer.bridge | The platform does notsupport bridge transfer. | bridge |
| The platform does notsupport the URI format used. The special variable _message (Section 5.2.2) will contain thestring "The URIx is not a supported URI format" wherex is the URI from the dest or destexpr <transfer>attributes. | blind and bridge |
The following example attempts to perform a bridge transferthe caller to a another party, and wait for that conversation toterminate. Prompts may be included before or within the<transfer> element. This may be used to inform the callerof what is happening, with a notice such as "Please wait while wetransfer your call." The <prompt> within the <block>,and the <prompt> within <transfer> are queued andplayed before actually performing the transfer. After the audioqueue is flushed, the outgoing call is initiated. By default, thecaller is connected to the outgoing telephony channel. The"transferaudio" attribute specifies an audio file to be played tothe caller in place of audio from the far-end until the far-endanswers. If the audio source is longer than the connect time, theaudio will stop playing immediately upon far-end answer.
Figure 10: Sequence and timing during an example of a bridgetransfer
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <var name="mydur" expr="0"/> <block> <!-- queued and played before starting the transfer --> <prompt> Calling Riley. Please wait. </prompt> </block> <!-- Play music while attempting to connect to far-end --> <!-- "hotword" bargeintype during transferaudio only --> <!-- Wait up to 60 seconds for the far end to answer --> <transfer name="mycall" dest="tel:+1-555-123-4567" transferaudio="music.wav" connecttimeout="60s" bridge="true"> <!-- queued and played before starting the transfer --> <!-- bargein properties apply during this prompt --> <prompt> Say cancel to disconnect this call at any time. </prompt> <!-- specify an external grammar to listen for "cancel" command --> <grammar src="cancel.grxml" type="application/srgs+xml"/> <filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Please call again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley can't answer the phone now. Please call again later. </prompt> </if> </filled> </transfer> <!-- submit call statistics to server --> <block> <submit namelist="mycall mydur" next="/cgi-bin/report"/> </block></form></vxml>
The <filled> element specifies an action to perform whensome combination of input items are filled. It may occur in twoplaces: as a child of the <form> element, or as a child ofan input item.
As a child of a <form> element, the <filled>element can be used to perform actions that occur when acombination of one or more input items is filled. For example,the following <filled> element does a cross-check to ensurethat a starting city field differs from the ending cityfield:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <field name="start_city"> <grammar src="http://www.grammars.example.com/voicexml/city.grxml" type="application/srgs+xml"/> <prompt>What is the starting city?</prompt> </field> <field name="end_city"> <grammar src="http://www.grammars.example.com/voicexml/city.grxml" type="application/srgs+xml"/> <prompt>What is the ending city?</prompt> </field> <filled mode="all" namelist="start_city end_city"> <if cond="start_city == end_city"> <prompt> You can't fly from and to the same city. </prompt> <clear/> </if> </filled></form></vxml>
If the <filled> element appears inside an input item, itspecifies an action to perform after that input item is filledin:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <field name="city"> <grammar type="application/srgs+xml" src="http://www.ship-it.example.com/grammars/served_cities.grxml"/> <prompt>What is the city?</prompt> <filled> <if cond="city == 'Novosibirsk'"> <prompt> Note, Novosibirsk service ends next year. </prompt> </if> </filled> </field></form></vxml>
After each gathering of the user's input, all the inputitems mentioned in the input are set, and then the interpreterlooks at each <filled> element in document order (nopreference is given to ones in input items vs. ones in the form).Those whose conditions are matched by the utterance are thenexecuted in order, until there are no more, or until onetransfers control or throws an event.
Attributes include:
mode | Either all (the default), or any. Ifany, this action is executed when any of the specified inputitems is filled by the last user input. If all, this action isexecuted when all of the mentioned input items are filled, and atleast one has been filled by the last user input. A<filled> element in an input item cannot specify a mode;if a mode is specified, then an error.badfetch is thrown bythe platform upon encountering the document. |
---|---|
namelist | The input items to trigger on. For a<filled> in a form, namelist defaults to the names(explicit and implicit) of the form's input items. A<filled> element in an input item cannot specify a namelist(the namelist in this case is the input item name); if anamelist is specified, then an error.badfetch is thrown by theplatform upon encountering the document. Note that controlitems are not permitted in this list; an error.badfetch isthrown when the document contains a <filled> element with anamelist attribute referencing a control item variable. |
A <link> element may have one or more grammars which arescoped to the element containing the <link>. A"scope" attribute on the element containing the <link> hasno effect on the scope of the <link> grammars (for example,when a <link> is contained in a <form> withscope="document", the <link> grammars are scoped to theform, not to the document). Grammar elements contained inthe <link> are not permitted to specify scope (seeSection 3.1.3 for details).When one of these grammars is matched, the link activates, andeither:
Transitions to a new document or dialog (like <goto>),or
Throws an event (like <throw>).
For instance, this link activates when you say "books" orpress "2".
<link next="http://www.voicexml.org/books/main.vxml"> <grammar mode="voice" version="1.0" root="root"> <rule scope="public"> <one-of> <item>books</item> <item>VoiceXML books</item> </one-of> </rule> </grammar> <grammar mode="dtmf" version="1.0" root="r2"> <rule scope="public"> 2 </rule> </grammar></link>
This link takes you to a dynamically determined dialog in thecurrent document:
<link expr="'#' + document.helpstate"> <grammar mode="voice" version="1.0" root="root"> <rule scope="public"> help </rule> </grammar></link>
The <link> element can be a child of <vxml>,<form>, or of the form items <field> and<initial>. A link at the <vxml> level has grammarsthat are active throughout the document. A link at the<form> level has grammars active while the user is in thatform. If an application root document has a document-level link,its grammars are active no matter what document of theapplication is being executed.
If execution is in a modal form item, then link grammars atthe form, document or application level are not active.
You can also define a link that, when matched, throws an eventinstead of going to a new document. This event is thrown at thecurrent location in the execution, not at the location where thelink is specified. For example, if the user matches thislink's grammar or enters '2' on the keypad, a help event isthrown in the form item the user was visiting and is handled bythe best qualified <catch> in the item's scope (seeSection 5.2.4 for furtherdetails):
<link dtmf="2" event="help"> <grammar mode="voice" version="1.0" root="r5"> <rule scope="public"> <one-of> <item>arrgh</item> <item>alas all is lost</item> <item>fie ye froward machine</item> <item>I don't get it</item> </one-of> </rule> </grammar></link>
When a link is matched, application.lastresult$ is assigned.This allows callflow decisions to be made downstream based on theactual semantic result. An example appears inSection 5.1.5.
Conceptually the link element can be thought of as having twoparts: condition and action. The "condition" is the content ofthe link element, i.e. the grammar(s) that must be matched inorder for the link to be activated. The "action" is specified bythe attributes of the element, i.e. where to transition or whichevent to throw. The "condition" is resolved/evaluated lexically,while the "action" is resolved/evaluated dynamically.Specifically this means that
Attributes of <link> are:
next | The URI to go to. This URI is adocument (perhaps with an anchor to specify the starting dialog),or a dialog in the current document (just a bare anchor). |
---|---|
expr | Like next, except that the URI isdynamically determined by evaluating the given ECMAScriptexpression. |
event | The event to throw when the usermatches one of the link grammars. |
eventexpr | An ECMAScript expression evaluatingto the name of the event to throw when the user matches one ofthe link grammars. |
message | A message string providing additionalcontext about the event being thrown. The message is available asthe value of a variable within the scope of the catch element,seeSection 5.2.2. |
messageexpr | An ECMAScript expression evaluatingto the message string. |
dtmf | The DTMF sequence for this link. Itis equivalent to a simple DTMF <grammar> and DTMFproperties (Section 6.3.3)apply to recognition of the sequence. Unlike DTMF grammars,whitespace is optional: dtmf="123#" is equivalent to dtmf="1 2 3#". The attribute can be used at the same time as other<grammar>s: the link is activated when user input matches alink grammar or the DTMF sequence. |
fetchaudio | SeeSection 6.1. This defaults to the fetchaudioproperty. |
fetchhint | SeeSection 6.1. This defaults to thedocumentfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the documentmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to thedocumentmaxstale property. |
Exactly one of "next", "expr", "event" or "eventexpr" must bespecified; otherwise, an error.badfetch event is thrown. Exactlyone of "message" or "messageexpr" may be specified; otherwise, anerror.badfetch event is thrown.
The <grammar> element is used to provide a speechgrammar that
specifies a set of utterances that a user may speak to performan action or supply information, and
for a matching utterance, returns a corresponding semanticinterpretation. This may be a simple value (such as a string), aflat set of attribute-value pairs (such as day, month, and year),or a nested object (for a complex request).
The <grammar> element is designed to accommodate anygrammar format that meets these two requirements. VoiceXMLplatforms must support at least one common format, the XML Formof the W3C Speech Recognition Grammar Specification[SRGS]. VoiceXML platformsshould support the Augmented BNF (ABNF) Form of theW3C Speech Recognition Grammar Specification[SRGS]. VoiceXML platforms may choose tosupport grammar formats other than SRGS. For instance, a platformmight use the <grammar> element's support for PCDATA toinline a proprietary grammar definition or use the "src" and"type" attributes for an external one.
VoiceXML platforms must be a Conforming XML Form GrammarProcessor as defined in the W3C Speech Recognition GrammarSpecification[SRGS]. Whilethis requires a platform to process documents with one or more"xml:lang" attributes defined, it does not require that theplatform must be multi-lingual. When an unsupported language isencountered, the platform throws anerror.unsupported.language event which specifies theunsupported language in its message variable.
The following elements are defined in the XML Form of the W3CSpeech Recognition Grammar Specification[SRGS] and are available in VoiceXML 2.0. Thisdocument does not redefine these elements. Refer to the W3CSpeech Recognition Grammar Specification[SRGS] for definitions and examples.
Element | Purpose | Section (in[SRGS]) |
---|---|---|
<grammar> | Root element of an XML grammar | 4. |
<meta> | Header declaration of meta content ofan HTTP equivalent | 4.11.1 |
<metadata> | Header declaration of XML metadatacontent | 4.11.2 |
<lexicon> | Header declaration of a pronunciationlexicon | 4.10 |
<rule> | Declare a named rule expansion of agrammar | 3. |
<token> | Define a word or other entity thatmay serve as input | 2.1 |
<ruleref> | Refer to a rule defined locally orexternally | 2.2 |
<item> | Define an expansion with optionalrepeating and probability | 2.3 |
<one-of> | Define a set of alternative ruleexpansions | 2.4 |
<example> | Element contained within a ruledefinition that provides an example of input that matches therule | 3.3 |
<tag> | Define an arbitrary string that to beincluded inline in an expansion which may be used for semanticinterpretation | 2.6 |
The <grammar> element may be used to specify aninline grammar or anexternal grammar. Aninline grammar is specified by the content of a <grammar>element and defines an entire grammar:
<grammar type="media-type" mode="voice">inline speech grammar</grammar>
It may be necessary in this case to enclose the content in aCDATA section[XML]. Forinline grammars the type parameter specifies a media type thatgoverns the interpretation of the content of the <grammar>element.
The following is an example of inline grammar defined by theXML Form of the W3C Speech Recognition Grammar Specification[SRGS].
<grammar mode="voice" xml:lang="en-US" version="1.0" root="command"> <!-- Command is an action on an object --> <!-- e.g. "open a window" --> <rule scope="public"> <ruleref uri="#action"/> <ruleref uri="#object"/> </rule> <rule> <one-of> <item> open </item> <item> close </item> <item> delete </item> <item> move </item> </one-of> </rule> <rule> <item repeat="0-1"> <one-of> <item> the </item> <item> a </item> </one-of> </item> <one-of> <item> window </item> <item> file </item> <item> menu </item> </one-of> </rule></grammar>
The following is the equivalent example of the inline grammardefined by the ABNF Form of the W3C Speech Recognition GrammarSpecification[SRGS].Because VoiceXML platforms are not required to support thisformat it may be less portable.
<grammar mode="voice" type="application/srgs">#ABNF 1.0;language en-US;mode voice;root $command; public $command = $action $object; $action = open | close | delete | move; $object = [the | a] (window | file | menu);</grammar>
An external grammar is specified by an element of the form
<grammar src="URI" type="media-type"/>
The media type is optional in this case because theinterpreter context will attempt to determine the typedynamically as described inSection 3.1.1.4.
If the src attribute is defined and there is an inline grammaras content of a grammar element then an error.badfetch event isthrown.
The following is an example of a reference to an externalgrammar written in the XML Form of the W3C Speech RecognitionGrammar Specification[SRGS].
<grammar type="application/srgs+xml" src="http://www.grammar.example.com/date.grxml"/>
The following example is the equivalent grammar reference fora grammar that is authored using the ABNF Form of the W3C SpeechRecognition Grammar Specification[SRGS].
<grammar type="application/srgs" src="http://www.grammar.example.com/date.gram"/>
A weight for the grammar can be specified by the weightattribute:
<grammar weight="0.6" src="form.grxml" type="application/srgs+xml"/>
Grammar elements, including those in link, field and formelements, can have a weight attribute. The grammar can be inline,external or built-in.
Weights follow the definition of weights on alternatives inthe W3C Speech Recognition Grammar Specification[SRGS §2.4.1]. A weight isa simple positive floating point values without exponentials.Legal formats are "n", "n.", ".n" and "n.n" where "n" is asequence of one or many digits.
A weight is nominally a multiplying factor in the likelihooddomain of a speech recognition search. A weight of "1.0" isequivalent to providing no weight at all. A weight greater than"1.0" positively biases the grammar and a weight less than "1.0"negatively biases the grammar. If unspecified, the default weightfor any grammar is "1.0". If no weight is specified for anygrammar element then all grammars are equally likely.
<link event="help"> <grammar weight="0.5" mode="voice" version="1.0" root="help"> <rule scope="public"> <item repeat="0-1">Please</item> help </rule> </grammar></link><form> <grammar src="form.grxml" type="application/srgs+xml"/> <field name="expireDate"> <grammar weight="1.2" src="http://www.example.org/grammar/date"/> </field></form>
In the example above, the semantics of weights is equivalentto the following XML grammar.
<grammar root="r1" type="application/srgs+xml"><rule> <one-of> <item weight="0.5"> <ruleref uri="#help"/> </item> <item weight="1.0"> <ruleref uri="form.grxml"/> </item> <item weight="1.2"> <ruleref uri="http://www.example.org/grammar/date"/></item> </one-of></rule><rule> <item repeat="0-1">Please</item> help</rule></grammar>
Implicit grammars, such as those in options, do not supportweights - use the <grammar> element instead for controlover grammar weight.
Grammar weights only affect grammar processing. They do notdirectly affect the post processing of grammar results, includinggrammar precedence when user input matches multiple activegrammar (seeSection3.1.4).
A weight has no effect on DTMF grammars (SeeSection 3.1.2). Any weightattribute specified in a grammar element whose mode attribute isdtmf is ignored.
<!-- weight will be ignored --><grammar mode="dtmf" weight="0.3" src="http://www.example.org/dtmf/number"/>
Appropriate weights are difficult to determine, and guessingweights does not always improve recognition performance.Effective weights are usually obtained by study of real speechand textual data on a particular platform. Furthermore, a grammarweight is platform specific. Note that different ASR engines maytreat the same weight value differently. Therefore, the weightvalue that works well on particular platform may generatedifferent results on other platforms.
Attributes of <grammar> inherited from the W3C SpeechRecognition Grammar Specification[SRGS] are:
version | Defines the version of thegrammar. |
---|---|
xml:lang | Thelanguage identifier of the grammar (forexample, "fr-CA" for Canadian French.) If omitted, the value isinherited down from the document hierarchy. |
mode | Defines the mode of the grammarfollowing the modes of the W3C Speech Recognition GrammarSpecification[SRGS]. |
root | Defines the rule which acts as theroot rule of the grammar. |
tag-format | Defines the tag content format forall tags within the grammar. |
xml:base | Declares the base URI from whichrelative URIs in the grammar are resolved. This base declaration has precedence over the <vxml>base URI declaration. If a local declaration is omitted, thevalue is inherited down the document hierarchy. |
The use and interpretation of these attributes is determinedas follows:
Attributes of <grammar> added by VoiceXML 2.0 are:
src | The URI specifying the location ofthe grammar and optionally a rulename within that grammar, if itis external. The URI is interpreted as a rule reference asdefined in Section 2.2 of the Speech Recognition GrammarSpecification[SRGS] but notall forms of rule reference are permitted from within VoiceXML.The rule reference capabilities are described in detail belowthis table. | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scope | Either "document", which makes thegrammar active in all dialogs of the current document (andrelevant application leaf documents), or "dialog", to make thegrammar active throughout the current form. If omitted, thegrammar scoping is resolved by looking at the parent element. SeeSection 3.1.3 for details onscoping including precedence behavior. | |||||||||||||||||||||||||
type | The preferred media type of the grammar. A resource indicatedby the URI reference in the src attribute may be available in oneor more media types. The author may specify the preferredmedia-type via the type attribute. When the content representedby a URI is available in many data formats, a VoiceXML platformmay use the preferred media-type to influence which of themultiple formats is used. For instance, on a server implementingHTTP content negotiation, the processor may use the preferredmedia-type to order the preferences in the negotiation. The resource representation delivered by dereferencing the URIreference may be considered in terms of two types. Thedeclared media-type is the asserted value for the resourceand theactual media-type is the true format of itscontent. The actual media-type should be the same as the declaredmedia-type, but this is not always the case (e.g. a misconfiguredHTTP server might return 'text/plain for an'application/srgs+xml' document). A specific URI scheme mayrequire that the resource owner always, sometimes, or neverreturn a media-type. The declared media-type is the valuereturned by the resource owner or, if none is returned, thepreferred media type. There may be no declared media-type if theresource owner does not return a value and no preferred type isspecified. Whenever specified, the declared media-type isauthoritative. Three special cases may arise. The declared media-type may notbe supported by the processor; in this case, anerror.unsupported.format is thrown by the platform. The declaredmedia-type may be supported but the actual media-type may notmatch; an error.badfetch is thrown by the platform. Finally,there may be no declared media-type; the behavior depends on thespecific URI scheme and the capabilities of the grammarprocessor. For instance, HTTP 1.1 allows document intraspection(see[RFC2616], section7.2.1), the data scheme falls back to a default media type, andlocal file access defines no guidelines. The following tableprovides some informative examples:
The tentative media types for the W3C grammar format are"application/srgs+xml" for the XML form and "application/srgs"for ABNF grammars. | |||||||||||||||||||||||||
weight | Specifies the weight of the grammar.SeeSection 3.1.1.3 | |||||||||||||||||||||||||
fetchhint | SeeSection 6.1. This defaults to thegrammarfetchhint property. | |||||||||||||||||||||||||
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. | |||||||||||||||||||||||||
maxage | SeeSection 6.1. This defaults to the grammarmaxageproperty. | |||||||||||||||||||||||||
maxstale | SeeSection 6.1. This defaults to thegrammarmaxstale property. |
Either an "src" attribute or a inline grammar (but not both)must be specified; otherwise, an error.badfetch event isthrown.
The <grammar> element is also extended in VoiceXML 2.0to allow PCDATA for inline grammar formats besides the XML Formof the W3C Speech Recognition Grammar Specification[SRGS].
When referencing an external grammar, the value of the srcattribute is a URI specifying the location of the grammar with anoptional fragment for the rulename. Section 2.2 of the SpeechRecognition Grammar Specification[SRGS] defines several forms of rule reference.The following are the forms that are permitted on a grammarelement in VoiceXML:
The following are the forms of rule reference defined by[SRGS] that arenot supported in VoiceXML 2.0.
The <grammar> element can be used to provide a DTMFgrammar that
specifies a set of key presses that a user may use to performan action or supply information, and
VoiceXML platforms are required to support the DTMF grammarXML format defined in Appendix D of the[SRGS] to advance application portability.
A DTMF grammar is distinguished from a speech grammar by themode attribute on the <grammar> element. An "xml:lang"attribute has no effect on DTMF grammar handling. In otherrespects speech and DTMF grammars are handled identicallyincluding the ability to define the grammar inline, or by anexternal grammar reference. The media type handling, scoping andfetching are also identical.
The following is an example of a simple inline XML DTMFgrammar that accepts as input either "1 2 3" or "#".
<grammar mode="dtmf" version="1.0" root="root"> <rule scope="public"> <one-of> <item> 1 2 3 </item> <item> # </item> </one-of> </rule></grammar>
Input item grammars are always scoped to thecontaining input item; that is, they are active only when thecontaining input item was chosen during the select phase ofthe FIA. Grammars contained in input items cannot specify ascope; if they do, an error.badfetch is thrown.
Link grammars are given the scope of the element thatcontains the link. Thus, if they are defined in the applicationroot document, links are also active in any other loadedapplication document. Grammars contained in links cannot specifya scope; if they do, an error.badfetch is thrown.
Form grammars are by default given dialog scope, sothat they are active only when the user is in the form. If theyare given scope document, they are active whenever the user is inthe document. If they are given scope document and the documentis the application root document, then they are also activewhenever the user is in another loaded document in the sameapplication. A grammar in a form may be given document scopeeither by specifying the scope attribute on the form element orby specifying the scope attribute on the <grammar> element.If both are specified, the grammar assumes the scope specified bythe <grammar> element.
Menu grammars are also by default given dialog scope,and are active only when the user is in the menu. But they can begiven document scope and be active throughout the document, andif their document is the application root document, also beactive in any other loaded document belonging to the application.Grammars contained in menu choices cannot specify a scope; ifthey do, an error.badfetch is thrown.
Sometimes a form may need to have some grammars activethroughout the document, and other grammars that should be activeonly when in the form. One reason for doing this is to minimizegrammar overlap problems. To do this, each individual<grammar> element can be given its own scope if that scopeshould be different than the scope of the <form> elementitself:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form scope="document"> <grammar type="application/srgs"> #ABNF 1.0; language en-gb; mode voice; root $command; public $command = one | two | three; </grammar> <grammar type="application/srgs" scope="dialog"> #ABNF 1.0; language en-gb; mode voice; root $command2; public $command2 = four | five | six; </grammar> </form></vxml>
When the interpreter waits for input as a result of visitingan input item, the following grammars are active:
grammars for that input item, including grammarscontained in links in that input item;
grammars for its form, including grammars contained in linksin that form;
grammars contained in links in its document, and grammars formenus and other forms in its document which are given documentscope;
grammars contained in links in its application root document,and grammars for menus and forms in its application root documentwhich are given document scope.
grammars defined by platform default event handlers, such ashelp, exit and cancel.
In the case that an input matches more than one activegrammar, the list above defines the precedence order. If theinput matches more than one active grammar with the sameprecedence, the precedence is determined using document order:the first grammar in document order has highest priority. If nogrammars are active when an input is expected, the platform mustthrow an error.semantic event. The error will be thrown in thecontext of the executing element. Menus behave with regard togrammar activation like their equivalent forms (seeSection 2.2.1).
If the form item is modal (i.e., its modal attribute is set totrue), all grammars except its own are turned off while waitingfor input. If the input matches a grammar in a form or menu otherthan the current form or menu, control passes to the other formor menu. If the match causes control to leave the current form,all current form data is lost.
Grammar activation is not affected by the inputmodes property.For instance, if the inputmodes property restricts input to justvoice, DTMF grammars will still be activated, but cannot bematched.
The Speech Recognition Grammar Specification defines a tagelement which contains content for semantic interpretation ofspeech and DTMF grammars (see Section 2.6 of[SRGS]).
The Semantic Interpretation for Speech Recognitionspecification[SISR]describes a syntax and semantics for tags and specifies how asemantic interpretation for user input can be computed using thecontent of tags associated with the matched tokens and rules. Thesemantic interpretation may be mapped into VoiceXML as describedinSection 3.1.6.
The semantic interpretation returned from a Speech RecognitionGrammar Specification[SRGS]grammar must be mapped into one or more VoiceXML ECMAScriptvariables. The process by which this occurs differs slightly forform- and field-level results; these differences will be exploredin the next sections. The format of the semantic interpretation,using either the proposed Natural Language Semantics MarkupLanguage[NLSML] or theECMAScript-like output format of[SISR], has no impact on this discussion. Forthe purposes of this discussion, the actual result returned fromthe recognizer is assumed to have been mapped into anECMAScript-like format which is identical to the representationin application.lastresult$.interpretation as discussed inSection 5.1.5.
It is possible that a grammar will match but not return asemantic interpretation. In this case, the platform will use theraw text string for the utterance as the semantic result.Otherwise, this case is handled exactly as if the semanticinterpretation consisted of a simple value.
Every input item has an associatedslot name which maybe used to extract part of the full semantic interpretation. Theslot name is the value of the 'slot' attribute, if present (onlypossible for <field> elements), or else the value of the'name' attribute (for <field>s without a slot attribute,and for other input items as well). If neither slot nor name ispresent, then the slot name is undefined.
The slot name is used during the Process Phase of theFIA to determine whether or notan input item matches. A match occurs when either the slot nameis the same as a top-level property or a slot name is used toselect a sub-property. A property having an undefined value (i.e.ECMAScript undefined) will not match. Likewise, slot names whichare undefined will never match. Examples are given inSection 3.1.6.3. Note that itis possible for a specific slot value to fill more than one inputitem if the slot names of the input items are the same.
The next sections concern mapping form-level and field-levelresults. There is also a brief discussion of other issues such asthe NL Semantics to ECMAScript mapping, transitioninginformation from ASR results to VoiceXML, and dealing withmismatches between the interpretation result and the VoiceXMLform.
Grammars specified at the form-level produce aform-levelresult which may fill multiple input items simultaneously.This may occur anytime, whether in an <initial> element orin an input item, that the user's input matches an activeform-level grammar.
Consider the interpretation result from the sentence "I wouldlike a coca cola and three large pizzas with pepperoni andmushrooms." The semantic interpretation may be copied intoapplication.lastresult$.interpretation as
{ drink: "coke", pizza: { number: "3", size: "large", topping: [ "pepperoni", "mushrooms" ] }}
The following table illustrates how this result from aform-level grammar would be assigned to various input itemswithin the form. Note that all input items that can be filledin from the interpretation are filled in simultaneously. Theexisting values of matching input item variables will beoverwritten, and these items will be marked for <filled>processing during the FIA's Process Phase as described inSection 2.4 andAppendix C.
VoiceXML field | Assigned ECMAScript value | Explanation |
---|---|---|
1.<field name="drink"/>--or-- <object name="drink"/>--or-- | "coke" | By default aninput item is assigned the top-level result property whosename matches the input item name. |
2.<field name="..." slot="drink"/> | "coke" | If specified for afield, the slot name overrides the field namefor selecting the result property. |
3.<field name="pizza"/>--or-- <object name="pizza"/>--or-- | {number: "3", size:"large", topping: ["pepperoni", "mushroom"]} | The input item nameor slot may select a property that is a non-scalar ECMAScriptvariable in the same way that a scalar value is selected in theprevious example. However the application must then handleinspecting the components of the object. This does not takeadvantage of the VoiceXML form-filling algorithm, in that missingslots in the result would not be automatically prompted for. Thismay be sufficient in situations where the server is prepared todeal with a structured object. Otherwise, an application mayprefer to use the method described in the next example. |
4.<field name="..." slot="pizza.number"/> <field name="..." slot="pizza.size"/> | "3" "large" | The slot may be used toselect a sub-property of the result. This approach distributesthe result among a number of fields. |
5.<field name="..." slot="pizza.topping"/> | ["pepperoni","mushroom"] | The selected propertymay be a compound object. |
The <field ... slot="pizza.foo"> examples above can beexplained by rules that are compatible with and arestraightforward extensions of the VoiceXML 1.0 "name" and "slot"attributes:
Grammars specified within an input item produce afield-level result which may fill only the particularinput item in which they are contained. These grammars are activeonly when the FIA is visiting that specific input item. This isuseful, for instance, in directed dialogs where a user isprompted individually for each input item.
A field-level result fills the associated inputitem in the following manner:
This process allows an input item to extract a particularproperty from the semantic interpretation. This may be combinedwith <filled> for achieve even greater control.
<field name="getdate"> <prompt>On what date would you like to fly?</prompt> <grammar src="http://server.example.com/date.grxml"/> <!-- this grammar always returns an object containing string values for the properties day, month, and year --> <filled> <assign name="getdate.datestring" expr="getdate.year + getdate.month + getdate.day"/> </filled></field>
A matching slot name allows an input item to extract part of asemantic interpretation. Consider this modified result from theearlier pizza example.
application.lastresult$.interpretation ={ drink: { size: 'large', liquid: 'coke' }, pizza: { number: '3', size: 'large', topping: ['pepperoni', 'mushroom' ] }, sidedish: undefined}
The table below revisits the definition of when the slot namematches a property in the result.
slot name | match or not? |
undefined | does not match |
drink | matches; top level property |
pizza | matches; top level property |
sidedish | does not match; no defined value |
size | does not match; not a top-levelproperty |
pizza.size | matches; sub-property |
pizza.liquid | does not match |
It is also possible to compare the behaviors of form-level andfield-level results. For this purpose, consider the followingdocument:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <grammar src="formlevel.grxml"/> <initial> Say something. </initial> <field name="x"> <grammar src="fieldx.grxml"/> </field> <field name="z" slot="y"> <grammar src="fieldz.grxml"/> </field> </form> </vxml>
This defines two input item variables, 'x' and 'z'. Thecorresponding slot names are 'x' and 'y' respectively. The nexttable describes the assignment of these variables depending onwhich grammar is recognized and what semantic result is returned.The shorthandvalueX is used to indicate 'the structuredobject or simple result value associated with the propertyx'.
application. lastresult$.interpretation | form-level result (formlevel.grxml) | field-level result in field x | field-level result in field z |
= 'hello' | no assignment; cycle FIA | x = 'hello' | z = 'hello' |
= { x: valueX } | x = valueX | x = valueX | z = { x: valueX } |
= { y: valueY } | z = valueY | x = { y: valueY } | z = valueY |
= { z: valueZ } | no assignment; cycle FIA | x = { z: valueZ } | z = { z: valueZ } |
= { x: valueX, y: valueY, z: valueZ } | x = valueX z = valueY | x = valueX | z = valueY |
= { a: valueA b: valueB } | no assignment; cycle FIA | x = { a: valueA, b: valueB } | z = { a: valueA, b: valueB } |
At the form level, simple results like the string 'hello'cannot match any input items; structured objects assign all inputitem variables with matching slot names. At the field level,simple results are always assigned to the input item variable;structured objects will extract the matching property, if itexists, or will otherwise be assigned the entire semanticresult.
1. Mapping from NL semantics to ECMAScript: If the NLSemantics Markup Language ([NLSML]) is used, a mapping needs to be definedfrom the NLSML representation to ECMAScript objects. Since bothtypes of representation have similar nested structures, thismapping is fairly straightforward. This mapping is discussed indetail in the NL Semantics specification.
2. Transitioning semantic results from ASR to VoiceXML: Theresult of processing the semantic tags of a W3C ASR grammar isthe value of the attribute of the root rule when all semanticattachment evaluations have been completed. In addition, the rootrule (like all non-terminals) has an associated "text" variablewhich contains the series of tokens in the utterance that isgoverned by that non-terminal. In the process of making ASRresults available to VoiceXML documents, the VoiceXML platform isnot only responsible for filling in the VoiceXML fields based onthe value of the attribute of the root rule, as described above,but also for filling in the shadow variables of the field. Thename$.utterance shadow variable of the field should be the sameas the "text" variable value for the ASR root rule. The platformis also responsible for instantiating the value of the shadowvariable "name$.confidence" based on information supplied by theASR platform, as well as the value of "name$.inputmode" based onwhether DTMF or speech was processed. Finally, the platform isresponsible for making this same information available in the"application.lastresult$" variable, defined in Section 5.1.5(specifically, "application.lastresult$.utterance","application.lastresult$.inputmode", and"application.lastresult$.interpretation"), with the exception ofapplication.lastresult$.confidence, which the platform sets tothe confidence of the entire utterance interpretation.
3. Mismatches between semantic results and VoiceXML fields:Mapping semantic results to VoiceXML depends on a tightcoordination between the ASR grammar and the VoiceXML markup.Since in the current framework there's nothing that enforcesconsistency between a grammar and the associated VoiceXML dialog,mismatches can occur due to developer oversight. Since thedialog's behaviour during these mismatches is difficult todistinguish from certain normal situations, verifying consistencyof information is extremely important. Some examples ofmismatches:
In order to address these potential problems, the committee islooking at various approaches to ensuring consistency between thegrammar and the VoiceXML.
The <prompt> element controls the output of synthesizedspeech and prerecorded audio. Conceptually, prompts areinstantaneously queued for play, so interpretation proceeds untilthe user needs to provide an input. At this point, the promptsare played, and the system waits for user input. Once the inputis received from the speech recognition subsystem (or the DTMFrecognizer), interpretation proceeds.
The <prompt> element has the following attributes:
bargein | Control whether a user can interrupta prompt. This defaults to the value of the bargein property. SeeSection 6.3.4. |
---|---|
bargeintype | Sets the type of bargein to be'speech', or 'hotword'. This defaults to the value of thebargeintype property. SeeSection 6.3.4. |
cond | An expression that must evaluate totrue after conversion to boolean in order for the prompt to beplayed. Default is true. |
count | A number that allows you to emitdifferent prompts if the user is doing something repeatedly.If omitted, it defaults to "1". |
timeout | The timeout that will be used forthe following user input. The value is a Time Designation (seeSection 6.5). The defaultnoinput timeout is platform specific. |
xml:lang | Thelanguage identifier for the prompt. Ifomitted, it defaults to the value specified in the document's"xml:lang" attribute. |
xml:base | Declares the base URI from whichrelative URIs in the prompt are resolved. This base declarationhas precedence over the <vxml> base URI declaration. If alocal declaration is omitted, the value is inherited down thedocument hierarchy. |
The content of the <prompt> element is modelled on theW3C Speech Synthesis Markup Language 1.0[SSML].
The following speech markup elements are defined in[SSML] and are available inVoiceXML 2.0. Refer to the W3C Speech Synthesis Markup Language1.0 [SSML] for definitionsand examples.
Element | Purpose | Section (in SSML spec) |
---|---|---|
<audio> | Specifies audio files to be playedand text to be spoken. | 3.3.1 |
<break> | Specifies a pause in the speechoutput. | 3.2.3 |
<desc> | Provides a description of anon-speech audio source in <audio>. | 3.3.3 |
<emphasis> | Specifies that the enclosed textshould be spoken with emphasis. | 3.2.2 |
<lexicon> | Specifies a pronunciation lexicon forthe prompt. | 3.1.4 |
<mark> | Ignored by VoiceXML platforms. | 3.3.2 |
<meta> | Specifies meta and "http-equiv"properties for the prompt. | 3.1.5 |
<metadata> | Specifies XML metadata content forthe prompt. | 3.1.6 |
<p> | Identifies the enclosed text as aparagraph, containing zero or more sentences | 3.1.7 |
<phoneme> | Specifies a phonetic pronunciationfor the contained text. | 3.1.9 |
<prosody> | Specifies prosodic information forthe enclosed text. | 3.2.4 |
<say-as> | Specifies the type of text constructcontained within the element. | 3.1.8 |
<s> | Identifies the enclosed text as asentence. | 3.1.7 |
<sub> | Specifies replacement spoken text forthe contained text. | 3.1.10 |
<voice> | Specifies voice characteristics forthe spoken text. | 3.2.1 |
When used in VoiceXML, additional properties are defined forthe <audio> (Section4.1.3) and <say-as> (Appendix P) elements. VoiceXML also allows<enumerate> and <value> elements to appear within the<prompt> element.
The VoiceXML platform must be a Conforming Speech SynthesisMarkup Language Processor as defined in the[SSML]. While this requires a platform toprocess documents with one or more "xml:lang" attributes defined,it does not require that the platform must be multi-lingual. Whenan unsupported language is encountered, the platform throws anerror.unsupported.language event which specifies thelanguage in its message variable.
You've seen prompts in the previous examples:
<prompt>Please say your city.</prompt>
You can leave out the <prompt> ... </prompt>if:
There is no need to specify a prompt attribute (like bargein),and
The prompt consists entirely of PCDATA (contains no speechmarkups) or consists of just an <audio> or <value>element.
For instance, these are also prompts:
Please say your city.<audio src="say_your_city.wav"/>
But in this example, the enclosing prompt elements arerequired due to the embedded speech markups:
<prompt>Please <emphasis>say</emphasis> your city.</prompt>
When prompt content is specified without an explicit<prompt> element, then the prompt attributes are defined asspecified inTable 34.
Prompts can consist of any combination of prerecorded files,audio streams, or synthesized speech:
<prompt> Welcome to the Bird Seed Emporium. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/> We have 250 kilogram drums of thistle seed for <say-as interpret-as="currency">$299.95</say-as> plus shipping and handling this month. <audio src="http://www.birdsounds.example.com/mourningdove.wav"/></prompt>
Audio can be played in any prompt. The audio content can bespecified via a URI, and in VoiceXML it can also be in anaudio variable previously recorded:
<prompt> Your recorded greeting is <audio expr="greeting"/> To rerecord, press 1. To keep it, press pound. To return to the main menu press star M. To exit press star, star X.</prompt>
The <audio> element can have alternate content in casethe audio sample is not available:
<prompt> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio></prompt>
If the audio file cannot be played (e.g. 'src' referencing or'expr' evaluating to an invalid URI, a file with an unsupportedformat, etc), the content of the audio element is played instead.The content may include text, speech markup, or another audioelement. If the audio file cannot be played and the contentof the audio element is empty, no audio is played and no errorevent is thrown.
If <audio> contains an 'expr' attribute evaluating toECMAScript undefined, then the element, including its alternatecontent, is ignored. This allows a developer to specify<audio> elements with dynamically assigned content which,if the element is not required, can be ignored by assigning its'expr' a null value. For example, the following code shows howthis could be used to play back a hand of cards usingconcatenated audio clips:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <!-- script contains the function sayCard(type,position) which takes as input the type of card description (audio or text) and its position in an array, and returns the selected card description in the specified array position; if there is no description in the requested array position, then returns ECMAScript undefined --> <script src="cardgame.js"/> <field name="takecard"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt> <audio src="you_have.wav">You have the following cards: </audio> <!-- maximum of hand of 5 cards is described --> <audio expr="sayCard(audio,1)"><value expr="sayCard(text,1)"/></audio> <audio expr="sayCard(audio,2)"><value expr="sayCard(text,2)"/></audio> <audio expr="sayCard(audio,3)"><value expr="sayCard(text,3)"/></audio> <audio expr="sayCard(audio,4)"><value expr="sayCard(text,4)"/></audio> <audio expr="sayCard(audio,5)"><value expr="sayCard(text,5)"/></audio> <audio src="another.wav">Would you like another card?</audio> </prompt> <filled> <if cond="takecard"> <script>takeAnotherCard()</script> <clear/> <else/> <goto next="./make_bid.vxml"/> </if> </filled> </field></form></vxml>
Attributes of <audio> defined in[SSML] are:
src | The URI of the audio prompt. SeeAppendix E for requiredaudio file formats; additional formats may be used if supportedby the platform. |
---|
Attributes of <audio> defined only in VoiceXML are:
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
---|---|
fetchhint | SeeSection 6.1. This defaults to the audiofetchhintproperty. |
maxage | SeeSection 6.1. This defaults to the audiomaxageproperty. |
maxstale | SeeSection 6.1. This defaults to the audiomaxstaleproperty. |
expr | An ECMAScript expression whichdetermines the source of the audio to be played. The expressionmay be either a reference to audio previously recorded with the<record/> item or evaluate to the URI of an audio resourceto fetch. |
Exactly one of "src" or "expr" must be specified; otherwise,an error.badfetch event is thrown.
Note that it is a platform optimization to stream audio: i.e.the platform may begin processing audio content as it arrives andnot to wait for full retrieval. The "prefetch" fetchhint can beused to request full audio retrieval prior to playback.
The <value> element is used to insert the value of anexpression into a prompt. It has one attribute:
expr | The expression to render. |
---|
For example if n is 12, the prompt
<prompt> <value expr="n*n"/> is the square of <value expr="n"/>.</prompt>
will result in the text string "144 is the square of 12" beingpassed to the speech synthesis engine.
The manner in which the value attribute is played iscontrolled by the surrounding speech synthesis markup. Forinstance, a value can be played as a date in the followingexample:
<var name="date" expr="'2000/1/20'"/><prompt> <say-as interpret-as="date"> <value expr="date"/> </say-as></prompt>
The text inserted by the <value> element is not subjectto any special interpretation; in particular, it is not parsed asan[SSML] document ordocument fragment. XML special characters (&, >, and <)are not treated specially and do not need to be escaped. Theequivalent effect may be obtained by literally inserting the textcomputed by the <value> element in a CDATA section. Forexample, when the following variable assignment:
<script> <![CDATA[ e1 = 'AT&T'; ]]></script>
is referenced in a prompt element as
<prompt> The price of <value expr="e1"/> is $1. </prompt>
the following output is produced.
The price of AT&T is $1.
If an implementation platform supports bargein, theapplication author can specify whether a user can interrupt, or"bargein" on, a prompt using speech or DTMF input. This speeds upconversations, but is not always desired. If the applicationauthor requires that the user must hear all of a warning, legalnotice, or advertisement, bargein should be disabled. This isdone with the bargein attribute:
<prompt bargein="false"><audio src="legalese.wav"/></prompt>
Users can interrupt a prompt whose bargein attribute is true,but must wait for completion of a prompt whose bargein attributeis false. In the case where several prompts are queued, thebargein attribute of each prompt is honored during the period oftime in which that prompt is playing. If bargein occurs duringany prompt in a sequence, all subsequent prompts are not played(even those whose bargein attribute is set to false). If thebargein attribute is not specified, then the value of thebargein property is used if set.
When the bargein attribute is false, input is notbuffered while the prompt is playing, and any DTMF inputbuffered in a transition state is deleted from the buffer (Section 4.1.8 describes inputcollection during transition states).
Note that not all speech recognition engines or implementationplatforms support bargein. For a platform to support bargein, itmust support at least one of the bargein types described inSection 4.1.5.1.
When bargein is enabled, the bargeintype attribute can beused to suggest the type of bargein the platform will performin response to voice or DTMF input. Possible values for thisattribute are:
speech | The prompt will bestopped as soon as speech or DTMF input is detected.The prompt is stopped irrespective of whether or not theinput matches a grammar and irrespective of which grammarsare active. |
---|---|
hotword | The prompt will not bestopped until a complete match of an active grammar is detected.Input that does not match a grammar is ignored (note that thiseven applies during the timeout period); as a consequence, anomatch event will never be generated in the case of hotwordbargein. |
If the bargeintype attribute is not specified, then the valueof the bargeintype property is used. Implementations that claimto support bargein are required to support at least one of thesetwo types. Mixing these types within a single queue of promptscan result in unpredictable behavior and is discouraged.
In the case of "speech" bargeintype, the exact meaning of"speech input" is necessarily implementation-dependent, due tothe complexity of speech recognition technology. It is expectedthat the prompt will be stopped as soon as the platform is ableto reliably determine that the input is speech. Stopping theprompt as early as possible is desireable because it avoids the"stutter" effect in which a user stops in mid-utterance andre-starts if he does not believe that the system has heardhim.
Tapered prompts are those that may change with eachattempt. Information-requesting prompts may become more terseunder the assumption that the user is becoming more familiar withthe task. Help messages become more detailed perhaps, under theassumption that the user needs more help. Or, prompts can changejust to make the interaction more interesting.
Each input item, <initial>, and menu has an internalprompt counter that is reset to one each time the form or menu isentered. Whenever the system selects a given input item inthe select phase of FIA and FIA does perform normal selection andqueuing of prompts (i.e., as described inSection 5.3.6, the previous iteration of FIA didnot end with a catch handler that had no reprompt), the inputitem's associated prompt counter is incremented. This is themechanism supporting tapered prompts.
For instance, here is a form with a form level prompt andfield level prompts:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <block> <prompt bargein="false"> Welcome to the ice cream survey. </prompt> </block> <field name="flavor"> <grammar mode="voice" version="1.0" root="root"> <rule scope="public"> <one-of> <item>vanilla </item> <item>chocolate</item> <item>strawberry</item> </one-of> </rule> </grammar> <prompt count="1">What is your favorite flavor?</prompt> <prompt count="3">Say chocolate, vanilla, or strawberry.</prompt> <help>Sorry, no help is available.</help> </field></form></vxml>
A conversation using this form follows:
C: Welcome to the ice cream survey.
C: What is your favorite flavor?(the"flavor" field's prompt counter is 1)
H: Pecan praline.
C: I do not understand.
C: What is your favorite flavor?(theprompt counter is now 2)
H: Pecan praline.
C: I do not understand.
C: Say chocolate, vanilla, or strawberry.
(prompt counter is 3) H: What if I hate those?
C: I do not understand.
C: Say chocolate, vanilla, or strawberry.
(prompt counter is 4) H: ...
This is just an example to illustrate the use of promptcounters. A polished form would need to offer a more extensiverange of choices and to deal with out of range values in moreflexible way.
When it is time to select a prompt, the prompt counter isexamined. The child prompt with the highest count attribute lessthan or equal to the prompt counter is used. If a prompt has nocount attribute, a count of "1" is assumed.
Aconditional prompt is one that is spoken only if itscondition is satisfied. In this example, a prompt is varied oneach visit to the enclosing form.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <var name="r" expr="Math.random()"/> <field name="another"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt cond="r < .50"> Would you like to hear another elephant joke? </prompt> <prompt cond="r >= .50"> For another joke say yes. To exit say no. </prompt> <filled> <if cond="another"> <goto next="#pick_joke"/> </if> </filled> </field></form></vxml>
When a prompt must be chosen, a set of prompts to be queued ischosen according to the following algorithm:
All elements that remain on the list will be queued forplay.
The timeout attribute specifies the interval of silenceallowed while waiting for user input after the end of the lastprompt. If this interval is exceeded, the platform will throw anoinput event. This attribute defaults to the value specified bythe timeout property (seeSection 6.3.4) at the time the prompt isqueued. In other words, each prompt has its own timeoutvalue.
The reason for allowing timeouts to be specified as promptattributes is to support tapered timeouts. For example, the usermay be given five seconds for the first input attempt, and tenseconds on the next.
The prompt timeout attribute determines the noinput timeoutfor the following input:
<prompt count="1"> Pick a color for your new Model T.</prompt><prompt count="2" timeout="120s"> Please choose color of your new nineteen twenty four Ford Model T. Possible colors are black, black, or black. Please take your time.</prompt>
If several prompts are queued before a field input, thetimeout of the last prompt is used.
A VoiceXML interpreter is at all times in one of twostates:
The waiting and transitioning states are related to the phasesof the Form Interpretation Algorithm as follows:
This distinction of states is made in order to greatlysimplify the programming model. In particular, an importantconsequence of this model is that the VoiceXML applicationdesigner can rely on all executable content (such as the contentof <filled> and <block> elements) being run tocompletion, because it is executed while in the transitioningstate, which may not be interrupted by input.
While in the transitioning state various prompts are queued,either by the <prompt> element in executable content or bythe <prompt> element in form items. In addition, audio maybe queued by the fetchaudio attribute. The queued prompts andaudio are played either
Note that when a prompt's bargein attribute is false, input isnot collected and DTMF input buffered in a transition state isdeleted (seeSection4.1.5).
When an ASR grammar is matched, if DTMF input was consumed bya simultaneously active DTMF grammar (but did not result in acomplete match of the DTMF grammar), the DTMF input may, atprocessor discretion, be discarded.
Before the interpreter exits all queued prompts are played tocompletion. The interpreter remains in the transitioning stateand no input is accepted while the interpreter is exiting.
It is a permissible optimization to begin playing promptsqueued during the transitioning state before reaching the waitingstate, provided that correct semantics are maintained regardingprocessing of the input audio received while the prompts areplaying, for example with respect to bargein and grammarprocessing.
The following examples illustrate the operation of these rulesin some common cases.
Typical non-fetching case: field, followed by executablecontent (such as <block> and <filled>), followed byanother field.
in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} </block> <field name="f2"> queues prompts {p2} enables grammars {g2} </field>
As a result of input received while waiting in field f0 thefollowing actions take place:
Typical fetching case: field, followed by executable content(such as <block> and <filled>) ending with a<goto> that specifies fetchaudio, ending up in a field in adifferent document that is fetched from a server.
in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} ends with goto f2 in d1 with fetchaudio fa </block>in document d1 <field name="f2"> queues prompts {p2} enables grammars {g2} </field>
As a result of input received while waiting in field f0 thefollowing actions take place:
As in Case 2, but no fetchaudio is specified.
in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} ends with goto f2 in d1 (no fetchaudio specified) </block>in document d1 <field name="f2"> queues prompts {p2} enables grammars {g2} </field>
As a result of input received while waiting in field f0 thefollowing actions take place:
VoiceXML variables are in all respects equivalent toECMAScript variables: they are part of the same variable space.VoiceXML variables can be used in a <script> just asvariables defined in a <script> can be used in VoiceXML.Declaring a variable using <var> is equivalent to using a'var' statement in a <script> element. <script> canalso appear everywhere that <var> can appear. VoiceXMLvariables are also declared by form items.
The variable naming convention is as in ECMAScript, but namesbeginning with the underscore character ("_") and names endingwith a dollar sign ("$") are reserved for internal use. VoiceXMLvariables, including form item variables, must not containECMAScript reserved words. They must also follow ECMAScript rulesfor referential correctness. For example, variable names must beunique and their declaration must not include a dot - "var x.y"is an illegal declaration in ECMAScript. Variable names whichviolate naming conventions or ECMAScript rules cause an'error.semantic' event to be thrown.
Variables are declared by <var> elements:
<var name="home_phone"/> <var name="pi" expr="3.14159"/> <var name="city" expr="'Sacramento'"/>
They are also declared by form items:
<field name="num_tickets"> <grammar type="application/srgs+xml" src="/grammars/number.grxml"/> <prompt>How many tickets do you wish to purchase?</prompt> </field>
Variables declared without an explicit initial value areinitialized to the ECMAScript undefined value. Variables mustbe declared before being used either in VoiceXML or ECMAScript.Use of an undeclared variable results in an ECMAScript errorwhich is thrown as an error.semantic. Variables declared using"var" in ECMAScript can be used in VoiceXML, just as declaredVoiceXML variables can be used in ECMAScript.
In a form, the variables declared by <var> and thosedeclared by form items are initialized when the form is entered.The initializations are guaranteed to take place in documentorder, so that this, for example, is legal:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <var name="one" expr="1"/> <field name="two" expr="one+1"> <grammar type="application/srgs+xml" src="/grammars/number.grxml"/> </field> <var name="three" expr="two+1"/> <field name="go_on" type="boolean"> <prompt>Say yes or no to continue</prompt> </field> </form></vxml>
When the user visits this <form>, the form'sinitialization first declares the variable one and sets its valueto 1. Then it declares the variable two and gives it the value 2.Then the initialization logic declares the variable three andgives it the value 3. The form interpretation algorithm thenenters its main interpretation loop and begins at the go_onfield.
VoiceXML uses an ECMAScript scope chain to allow variables tobe declared at different levels of hierarchy in an application.For instance, a variable declared at document scope can bereferenced anywhere within that document, whereas a localvariable declared in a catch element is only available withinthat catch element. In order to preserve these scoping semantics,all ECMAScript variables must be declared. Use of an undeclaredvariable results in an ECMAScript error which is thrown as anerror.semantic.
Variables can be declared in following scopes:
session | These are read-only variables thatpertain to an entire user session. They are declared and set bythe interpreter context. New session variables cannot be declaredby VoiceXML documents. SeeSection 5.1.4. |
---|---|
application | These are declared with <var>and <script> elements that are children of theapplication root document's <vxml> element. They areinitialized when the application root document is loaded. Theyexist while the application root document is loaded, and arevisible to the root document and any other loaded applicationleaf document. Note that while executing inside theapplication root document document.x is equivalent toapplication.x. |
document | These variables are declared with<var> and <script> elements that are children ofthe document's <vxml> element. They are initialized whenthe document is loaded. They exist while the document is loaded.They are visible only within that document, unless the documentis an application root, in which case the variables are visibleby leaf documents through the application scope only. |
dialog | Each dialog (<form> or<menu>) has a dialog scope that exists while the user isvisiting that dialog, and which is visible to the elements ofthat dialog. Dialog scope contains the following variables:variables declared by <var> and <script> childelements of <form>, form item variables, and form itemshadow variables. The child <var> and <script>elements of <form> are initialized when the form is firstvisited, as opposed to <var> elements inside executablecontent which are initialized when the executable content isexecuted. |
(anonymous) | Each <block>, <filled>,and <catch> element defines a new anonymous scope tocontain variables declared in that element. |
The following diagram shows the scope hierarchy:
Figure 11: The scope hierarchy.
The curved arrows in this diagram show that each scopecontains a pre-defined variable whose name is the same as thescope that refers to the scope itself. This allows you forexample in the anonymous, dialog, and document scopes to refer toa variableX in the document scope usingdocument.X. As another example, a <filled>'svariable scope is an anonymous scope local to the <filled>,whose parent variable scope is that of the <form>.
It is not recommended to use "session", "application","document", and "dialog" as the names of variables and formitems. While they are not reserved words, using them hides thepre-defined variables with the same name because of ECMAScriptscoping rules used by VoiceXML.
Variables are referenced in cond and expr attributes:
<if cond="city == 'LA'"> <assign name="city" expr="'Los Angeles'"/> <elseif cond="city == 'Philly'"/> <assign name="city" expr="'Philadelphia'"/> <elseif cond="city =='Constantinople'"/> <assign name="city" expr="'Istanbul'"/> </if> <assign name="var1" expr="var1 + 1"/> <if cond="i > 1"> <assign name="i" expr="i-1"/> </if>
The expression language used in cond and expr is preciselyECMAScript. Note that the cond operators "<", "<=", and"&&" must be escaped in XML (to "<" and soon).
Variable references match the closest enclosing scopeaccording to the scope chain given above. You can prefix areference with a scope name for clarity or to resolve ambiguity.For instance to save the value of a variable associated with oneof the fields in a form for use later on in adocument:
<assign name="document.ssn" expr="dialog.ssn"/>
If the application root document has a variable x, it isreferred to as application.x in non-root documents, and eitherapplication.x or document.x in the application root document.If the document does not have a specified application rootand has a variable x, it is referred to as either application.xor document.x in the document.
Interpretations are sorted by confidence score, from highestto lowest. Interpretations with the same confidence score arefurther sorted according to the precedence relationship (seeSection 3.1.4) among thegrammars producing the interpretations. Different elements inapplication.lastresult$ will always differ in their utterance,interpretation, or both.
The number of application.lastresult$ elements is guaranteedto be greater than or equal to one and less than or equal to thesystem property "maxnbest". If no results have been generated bythe system, then "application.lastresult$" shall be ECMAScriptundefined.
Additionally, application.lastresult$ itself contains theproperties confidence, utterance, inputmode, and interpretationcorresponding to those of the 0th element in the ECMAScriptarray.
All of the shadow variables described above are setimmediately after any recognition. In this context, a<nomatch> event counts as a recognition, and causes thevalue of "application.lastresult$" to be set, though thevalues stored in application.lastresult$ are platform dependent.In addition, the existing values of field variables are notaffected by a <nomatch>. In contrast, a<noinput> event does not change the value of"application.lastresult$". After the value of"application.lastresult$" is set, the value persists(unless it is modified by the application) until thebrowser enters the next waiting state, when it is set toundefined. Similarly, when an application root documentis loaded, this variable is set to the valueundefined.The variable application.lastresult$ and all of itscomponents are writeable and can be modified by theapplication.
The following example shows how application.lastresult$ can beused in a field level <catch> to access a <link>grammar recognition result and transition to different dialogstates depending on confidence:
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><link event="menulinkevent"> <grammar src="/grammars/linkgrammar.grxml" type="application/srgs+xml"/></link><form> <field> <prompt> Say something </prompt> <catch event="menulinkevent"> <if cond="application.lastresult$.confidence < 0.7"> <goto nextitem="confirmlinkdialog"/> <else/> <goto next="./main_menu.html"/> </if> </catch> </field></form></vxml>
The final example demonstrates how a script can be used toiterate over the array of results in application.lastresult$,where each element is represented by"application.lastresult$[i]":
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <field name="color"> <prompt> Say a color </prompt> <grammar type="application/srgs+xml" src="color.grxml" /> <filled> <var name="confident_count" expr="0"/> <script> <![CDATA[ // number of results var len = application.lastresult$.length; // iterate through array for (var i = 0; i < len; i++) { // check if DTMF if (application.lastresult$[i].confidence > .7) { confident_count++; } } ]]> </script> <if cond="confident_count > 1"> <goto next="#verify"/> </if> </filled> </field></form></vxml>
The platform throws events when the user does not respond,doesn't respond in a way that the application understands,requests help, etc. The interpreter throws events if it findsa semantic error in a VoiceXML document, or when it encountersa <throw> element. Events are identified by characterstrings.
Each element in which an event can occur has a set ofcatchelements, which include:
<catch>
<error>
<help>
<noinput>
<nomatch>
An element inherits the catch elements ("as if by copy") fromeach of its ancestor elements, as needed. If a field, forexample, does not contain a catch element for nomatch, but itsform does, the form's nomatch catch element is used. Inthis way, common event handling behavior can be specified at anylevel, and it applies to all descendents.
The "as if by copy" semantics for inheriting catch elementsimplies that when a catch element is executed, variables areresolved and thrown events are handled relative to the scopewhere the original event originated, not relative to the scopethat contains the catch element. For example, consider a catchelement that is defined at document scope handling an event thatoriginated in a <field> within the document. In such acatch element variable references are resolved relative to the<field>'s scope, and if an event is thrown by the catchelement it is handled relative to the <field>. Similarly,relative URI references in a catch element are resolved againstthe active document and not relative to the document in whichthey were declared. Finally, properties are resolved relative tothe element where the event originated. For example, a promptelement defined as part of a document level catch would use theinnermost property value of the active form item to resolve itstimeout attribute if no value is explicitly specified.
The <throw> element throws an event. These can be thepre-defined ones:
<throw event="nomatch"/> <throw event="connection.disconnect.hangup"/>
or application-defined events:
<throw event="com.att.portal.machine"/>
Attributes of <throw> are:
event | The event being thrown. |
---|---|
eventexpr | An ECMAScript expression evaluatingto the name of the event being thrown. |
message | A message string providing additionalcontext about the event being thrown. For the pre-defined eventsthrown by the platform, the value of the message isplatform-dependent. The message is available as the value of a variable within thescope of the catch element, see below. |
messageexpr | An ECMAScript expression evaluatingto the message string. |
Exactly one of "event" or "eventexpr" must be specified;otherwise, an error.badfetch event is thrown. Exactly one of"message" or "messageexpr" may be specified; otherwise, anerror.badfetch event is thrown.
Unless explicited stated otherwise, VoiceXML does not specifywhen events are thrown.
The catch element associates a catch with a document, dialog,or form item (except for blocks). It contains executablecontent.
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <field name="user_id" type="digits"> <prompt>What is your username</prompt> </field> <field name="password"> <prompt>What is the code word?</prompt> <grammar version="1.0" root="root"> <rule scope="public">rutabaga</rule> </grammar> <help>It is the name of an obscure vegetable.</help> <catch event="nomatch noinput" count="3"> <prompt>Security violation!</prompt> <submit next="http://www.example.com/apprehend_felon.vxml" namelist="user_id"/> </catch> </field> </form></vxml>
The catch element's anonymous variable scope includes thespecial variable _event which contains the name of the event thatwas thrown. For example, the following catch element can handletwo types of events:
<catch event="event.foo event.bar"> <if cond="_event=='event.foo'"> <!-- Play this for event.foo events --> <audio src="foo.wav"/> <else/> <!-- Play this for event.bar events --> <audio src="bar.wav"/> </if> <!-- Continue with common handling for either event --></catch>
The _event variable is inspected to select the audio to playbased on the event that was thrown. The foo.wav file will beplayed for event.foo events. The bar.wav file will be played forevent.bar events. The remainder of the catch element containsexecutable content that is common to the handling of both eventtypes.
The catch element's anonymous variable scope also includes thespecial variable _message which contains the value of the messagestring from the corresponding <throw> element, or aplatform-dependent value for the pre-defined events raised by theplatform. If the thrown event does not specify a message, thevalue of _message is ECMAScript undefined.
If a <catch> element contains a <throw> elementwith the same event, then there may be an infinite loop:
<catch event="help"> <throw event="help"/> </catch>
A platform could detect this situation and throw a semanticerror instead.
Attributes of <catch> are:
event | The event or events to catch. Aspace-separated list of events may be specified, indicating thatthis <catch> element catches all the events named in thelist. In such a case a separate event counter (see "count"attribute) is maintained for each event. If the attribute isunspecified, all events are to be caught. |
---|---|
count | The occurrence of the event (defaultis 1). The count allows you to handle different occurrences ofthe same event differently. Each <form>, <menu>, and form item maintains acounter for each event that occurs while it is being visited.Item-level event counters are used for events thrown whilevisiting individual form items and while executing <filled>elements contained within those items. Form-level and menu-levelcounters are used for events thrown during dialog initializationand while executing form-level <filled> elements. Form-level and menu-level event counters are reset each timethe <menu> or <form> is re-entered. Form-level andmenu-level event counters are not reset by the <clear>element. Item-level event counters are reset each time the <form>containing the item is re-entered. Item-level event counters arealso reset when the item is reset with the <clear> element.An item's event counters are not reset when the item isre-entered without leaving the <form>. Counters are incremented against the full event name and everyprefix matching event name; for example, occurrence of the event"event.foo.1" increments the counters for "event.foo.1" plus"event.foo" and "event". |
cond | An expression which must evaluate totrue after conversion to boolean in order for the event to becaught. Defaults to true. |
The <error>, <help>, <noinput>, and<nomatch> elements are shorthands for very common types of<catch> elements.
The <error> element is short for <catchevent="error"> and catches all events of type error:
<error> An error has occurred -- please call again later. <exit/></error>
The <help> element is an abbreviation for <catchevent="help">:
<help>No help is available.</help>
The <noinput> element abbreviates <catchevent="noinput">:
<noinput>I didn't hear anything, please try again.</noinput>
And the <nomatch> element is short for <catchevent="nomatch">:
<nomatch>I heard something, but it wasn't a known city.</nomatch>
These elements take the attributes:
count | The event count (as in<catch>). |
---|---|
cond | An optional condition to test to seeif the event is caught by this element (as in <catch>described inSection 5.2.2).Defaults to true. |
An element inherits the catch elements ("as if by copy") fromeach of its ancestor elements, as needed. For example, if a<field> element inherits a <catch> element from thedocument
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><catch event="event.foo"> <audio src="beep.wav"/></catch><form> <field name="color"> <prompt>Please say a primary color</prompt> <grammar type="application/srgs">red | yellow | blue</grammar> <nomatch> <throw event="event.foo"/> </nomatch> </field></form></vxml>
then the <catch> element is implicitly copied into<field> as if defined below:
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <field> <prompt>Please say a primary color</prompt> <grammar type="application/srgs">red | yellow | blue</grammar> <nomatch> <throw event="event.foo"/> </nomatch> <catch event="event.foo"> <audio src="beep.wav"/> </catch> </field></form></vxml>
When an event is thrown, the scope in which the event ishandled and its enclosing scopes are examined to find thebestqualified catch element, according to the followingalgorithm:
The name of a thrown event matches the catch element eventname if it is an exact match, a prefix match or if the catchevent attribute is not specified (note that the event attributecannot be specified as an empty string - event="" is syntacticallyinvalid). A prefix match occurs when the catch element eventattribute is a token prefix of the name of the event being thrown,where the dot is the token separator, all trailing dots areremoved, and a remaining empty string matches everything. Forexample,
<catch event="connection.disconnect"> <prompt>Caught a connection dot disconnect event</prompt></catch>
will prefix match the eventconnection.disconnect.transfer.
<catch event="com.example.myevent"> <prompt>Caught a com dot example dot my event</prompt></catch>
prefix matches com.example.myevent.event1.,com.example.myevent. and com.example.myevent..event1 but notcom.example.myevents.event1. Finally,
<catch event="."> <prompt>Caught an event</prompt></catch>
prefix matches all events (as does <catch> without anevent attribute).
Note that the catch element selection algorithm gives priorityto catch elements that occur earlier in a document over thosethat occur later, but does not give priority to catch elementsthat are more specific over those that are less specific.Therefore is generally advisable to specify catch elements inorder from more specific to less specific. For example, it wouldbe advisable to specify catch elements for "error.foo" and"error" in that order, as follows:
<catch event="error.foo"> <prompt>Caught an error dot foo event</prompt></catch> <catch event="error"> <prompt>Caught an error event</prompt> </catch>
If the catch elements were specified in the opposite order,the catch element for "error.foo" would never be executed.
The interpreter is expected to provide implicit default catchhandlers for the noinput, help, nomatch, cancel, exit, and errorevents if the author did not specify them.
The system default behavior of catch handlers for variousevents and errors is summarized by the definitions below thatspecify (1) whether any audio response is to be provided, and (2)how execution is affected. Note: where an audio response isprovided, the actual content is platform dependent.
Event Type | Audio Provided | Action |
---|---|---|
cancel | no | don't reprompt |
error | yes | exit interpreter |
exit | no | exit interpreter |
help | yes | reprompt |
noinput | no | reprompt |
nomatch | yes | reprompt |
maxspeechtimeout | yes | reprompt |
connection.disconnect | no | exit interpreter |
all others | yes | exit interpreter |
Specific platforms will differ in the default promptspresented.
There are pre-defined events, and application andplatform-specific events. Events are also subdivided into plainevents (things that happen normally), and error events (abnormaloccurrences). The error naming convention allows for multiplelevels of granularity.
A conforming browser may throw an event that extends apre-defined event string so long as the event contains thespecified pre-defined event string as a dot-separated exactinitial substring of its event name. Applications that writecatch handlers for the pre-defined events will be interoperable.Applications that write catch handlers for extended event namesare not guaranteed interoperability. For example, if in loading agrammar file a syntax error is detected the platform must throw"error.badfetch". Throwing "error.badfetch.grammar.syntax" is anacceptable implementation.
Components of event names in italics are to be substitutedwith the relevant information; for example, inerror.unsupported.element,element is substitutedwith the name of VoiceXML element which is not supported such aserror.unsupported.transfer. All other event name components arefixed.
Further information about an event may be specified in the"_message" variable (seeSection5.2.2).
The pre-defined events are:
In addition to transfer errors (Section 2.3.7.3), the pre-defined errorsare:
Errors encountered during document loading, includingtransport errors (no document found, HTTP status code 404, and soon) and syntactic errors (no <vxml> element, etc) result ina badfetch error event raised in the calling document. Errorsthat occur after loading and before entering the initializationphase of the Form Interpretation Algorithm are handled in aplatform-specific manner. Errors that occur after entering theFIA initialization phase, such as semantic errors, are raised inthe new document. The handling of errors encountered during theloading of the first document in a session isplatform-specific.
Application-specific and platform-specific event types shoulduse the reversed Internet domain name convention to avoid namingconflicts. For example:
Catches can catch specific events (cancel) or all thosesharing a prefix (error.unsupported).
Executable content refers to a block of procedurallogic. Such logic appears in:
The <block> form item.
The <filled> actions in forms and input items.
Event handlers (<catch>, <help>, et cetera).
Executable elements are executed in document order in theirblock of procedural logic. If an executable element generates anerror, that error is thrown immediately. Subsequent executableelements in that block of procedural logic are not executed.
This section covers the elements that can occur in executablecontent.
This element declares a variable. It can occur in executablecontent or as a child of <form> or <vxml>.Examples:
<var name="phone" expr="'6305551212'"/> <var name="y" expr="document.z+1"/>
If it occurs in executable content, it declares a variable inthe anonymous scope associated with the enclosing <block>,<filled>, or catch element. This declaration is made onlywhen the <var> element is executed. If the variable isalready declared in this scope, subsequent declarations act asassignments, as in ECMAScript.
If a <var> is a child of a <form> element, itdeclares a variable in the dialog scope of the <form>. Thisdeclaration is made during the form's initialization phaseas described inSection2.1.6.1. The <var> element is not a form item, and sois not visited by the Form Interpretation Algorithm's mainloop.
If a <var> is a child of a <vxml> element, itdeclares a variable in the document scope; and if it is thechild of a <vxml> element in a root document then it alsodeclares the variable in the application scope. This declarationis made when the document is initialized; initializations happenin document order.
Attributes of <var> include:
name | The name of the variable that willhold the result. Unlike the name attribute of <assign>element (Section 5.3.2),this attribute must not specify a variable with a scope prefix(if a variable is specified with a scope prefix, then anerror.semantic event is thrown). The scope in which the variableis defined is determined from the position in the document atwhich the element is declared. |
---|---|
expr | The initial value of the variable(optional). If there is no expr attribute, the variable retainsits current value, if any. Variables start out with theECMAScript value undefined if they are not given initialvalues. |
The <assign> element assigns a value to a variable:
<assign name="flavor" expr="'chocolate'"/> <assign name="document.mycost" expr="document.mycost+14"/>
It is illegal to make an assignment to a variable that has notbeen explicitly declared using a <var> element or a varstatement within a <script>. Attempting to assign to anundeclared variable causes an error.semantic event to bethrown.
Note that when an ECMAScript object, e.g. "obj", has beenproperly initialized then its properties, for instance"obj.prop1", can be assigned without explicit declaration (infact, an attempt to declare ECMAScript object properties such as"obj.prop1" would result in an error.semantic event beingthrown).
Attributes include:
name | The name of the variable beingassigned to. As specified inSection 5.1.2, the corresponding variablemust have been previously declared otherwise an error.semanticevent is thrown. By default, the scope in which the variable isresolved is the closest enclosing scope of the currently activeelement. To remove ambiguity, the variable name may be prefixedwith a scope name as described inSection 5.1.3. |
---|---|
expr | The new value of the variable. |
The <clear> element resets one or more variables,including form items. For each specified variable name, thevariable is resolved relative to the current scope according toSection 5.1.3 (to removeambiguity, each variable name in the namelist may be prefixedwith a scope name). Once a declared variable has been identified,its value is assigned the ECMAScript undefined value. Inaddition, if the variable name corresponds to a form item, thenthe form item's prompt counter and event counters arereset.
For example:
<clear namelist="city state zip"/>
The attribute is:
namelist | The list of variables to be reset;this can include variable names other than form items. If anundeclared variable is referenced in the namelist, then anerror.semantic is thrown (Section 5.1.1). When not specified, allform items in the current form are cleared. |
---|
The <if> element is used for conditional logic. It hasoptional <else> and <elseif> elements.
<if cond="total > 1000"> <prompt>This is way too much to spend.</prompt> <throw event="com.xyzcorp.acct.toomuchspent"/> </if> <if cond="amount < 29.95"> <assign name="x" expr="amount"/> <else/> <assign name="x" expr="29.95"/> </if> <if cond="flavor == 'vanilla'"> <assign name="flavor_code" expr="'v'"/> <elseif cond="flavor == 'chocolate'"/> <assign name="flavor_code" expr="'h'"/> <elseif cond="flavor == 'strawberry'"/> <assign name="flavor_code" expr="'b'"/> <else/> <assign name="flavor_code" expr="'?'"/> </if>
Prompts can appear in executable content, in their fullgenerality, except that the <prompt> count attribute ismeaningless. In particular, the cond attribute can be used inexecutable content. Prompts may be wrapped with <prompt>and </prompt>, or represented using PCDATA. Wherever<prompt> is allowed, the PCDATAxyz is interpretedexactly as if it had appeared as<prompt>xyz</prompt>.
<nomatch count="1"> To open the pod bay door, say your code phrase clearly. </nomatch> <nomatch count="2"> <prompt> This is your <emphasis>last</emphasis> chance. </prompt> </nomatch> <nomatch count="3"> Entrance denied. <exit/> </nomatch>
The FIA expects a catch element to queue appropriate promptsin the course of handling an event. Therefore, the FIA does notgenerally perform the normal selection and queuing of prompts onthe next iteration following the execution of a catch element.However, the FIA does perform normal selection and queueing ofprompts after the execution of a catch element (<catch>,<error>, <help>, <noinput>, <nomatch>) intwo cases:
In these two cases, after the FIA selects the next form itemto visit, it performs normal prompt processing, includingselecting and queuing the form item's prompts and incrementingthe form item's prompt counter.
For example, this noinput catch expects the next form itemprompt to be selected and played:
<field name="want_ice_cream"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt>Do you want ice cream for dessert?</prompt> <prompt count="2"> If you want ice cream, say yes. If you do not want ice cream, say no. </prompt> <noinput> I could not hear you. <!-- Cause the next prompt to be selected and played. --> <reprompt/> </noinput> </field>
A quiet user would hear:
C: Do you want ice cream for dessert?
H:(silence)
C: I could not hear you.
C: If you want ice cream, say yes. If you don't want icecream, say no.
H:(silence)
C: I could not hear you.
C: If you want ice cream, say yes. If you don't want icecream, say no.
H: No
If there were no <reprompt>, the user would insteadhear:
C: Do you want ice cream for dessert?
H:(silence)
C: I could not hear you.
H:(silence)
C: I could not hear you.
H: No
Note that a consequence of skipping the prompt selection phaseas described above is that the prompt counter of the form itemselected by the FIA after the execution of a catch element (thatdoes not execute a <reprompt>, or leave the dialog via<goto>, <submit> or <return>) will not beincremented.
Also note that the prompt selection phase following theexecution of a catch element (that does not execute a<reprompt> or leave the dialog via <goto>,<submit> or <return>) is skipped even if the formitem selected by the FIA is different from the previous formitem.
The <reprompt> element has no effect outside of acatch.
The <goto> element is used to:
transition to another form item in the current form,
transition to another dialog in the current document, or
transition to another document.
To transition to another form item, use the nextitemattribute, or the expritem attribute if the form item name iscomputed using an ECMAScript expression:
<goto nextitem="ssn_confirm"/> <goto expritem="(type==12)? 'ssn_confirm' : 'reject'"/>
To go to another dialog in the same document, use next (orexpr) with only a URI fragment:
<goto next="#another_dialog"/> <goto expr="'#' + 'another_dialog'"/>
To transition to another document, use next (or expr) with aURI:
<goto next="http://flight.example.com/reserve_seat"/> <goto next="./special_lunch#wants_vegan"/>
The URI may be absolute or relative to the current document.You may specify the starting dialog in the next document using afragment that corresponds to the value of the id attribute of adialog. If no fragment is specified, the first dialog in thatdocument is chosen.
Note that transitioning to another dialog in the currentdocument causes the old dialog's variables to be lost, evenin the case where a dialog is transitioning to itself.Transitioning to another document using an absolute or relativeURI will likewise drop the old document level variables, even ifthe new document is the same one that is making the transition.However, document variables are retained when transitioning to anempty URI reference with a fragment identifier. For example, thefollowing statements behave differently in a document with theURI http://someco.example.com/index.vxml:
<goto next="#foo"/><goto next="http://someco.example.com/index.vxml#foo"/>
According to[RFC2396], the fragment identifier (the partafter the '#') is not part of a URI and transitioning to emptyURI references plus fragment identifiers should never result in anew document fetch. Therefore "#foo" in the first statement is anempty URI reference with a fragment identifier and documentvariables are retained. In the second statement "#foo" is part ofan absolute URI and the document variables are lost. If you wantdata to persist across multiple documents, store data in theapplication scope.
The dialog to transition to is specified by the URI referencein the <goto>'s next or expr attribute (see[RFC2396]). If this URIreference contains an absolute or relative URI, which mayinclude a query string, then that URI is fetched and the dialogis found in the resulting document.
If the URI reference contains only a fragment (i.e., noabsolute or relative URI), then there is no fetch: the dialog isfound in the current document.
The URI reference's fragment, if any, names the dialog totransition to. When there is no fragment, the dialog chosen isthe lexically first dialog in the document.
If the form item, dialog or document to transition to is notvalid (i.e. the form item, dialog or document does not exist),an error.badfetch must be thrown. Note that for errors which occurduring a dialog or document transition, the scope in which errorsare handled is platform specific. For errors which occurduring form item transition, the event is handled in the dialogscope.
Attributes of <goto> are:
next | The URI to which to transition. |
---|---|
expr | An ECMAScript expression that yieldsthe URI. |
nextitem | The name of the next form item tovisit in the current form. |
expritem | An ECMAScript expression that yieldsthe name of the next form item to visit. |
fetchaudio | SeeSection 6.1. This defaults to the fetchaudioproperty. |
fetchhint | SeeSection 6.1. This defaults to thedocumentfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the documentmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to thedocumentmaxstale property. |
Exactly one of "next", "expr", "nextitem" or "expritem" mustbe specified; otherwise, an error.badfetch event is thrown.
The <submit> element is used to submit information tothe origin Web server and then transition to the document sentback in the response. Unlike <goto>, it lets you submit alist of variables to the document server via an HTTP GET or POSTrequest. For example, to submit a set of form items to the serveryou might have:
<submit next="log_request" method="post" namelist="name rank serial_number" fetchtimeout="100s" fetchaudio="audio/brahms2.wav"/>
The dialog to transition to is specified by the URI referencein the <submit>'s next or expr attribute (see[RFC2396], Section 4.2). TheURI is always fetched even if it contains just a fragment. In thecase of a fragment, the URI requested is the base URI of thecurrent document. This means that the following two elements havesubstantially different effects:
<goto next="#get_pin"/><submit next="#get_pin"/>
Note that although the URI is always fetched and the resultingdocument is transitioned to, some <submit> requests can besatisfied by intermediate caches. This might happen, for example,if the origin Web server provides an explicit expiration time withthe response.
If the dialog or document to transition to is not valid (i.e.the dialog or document does not exist), an error.badfetch must bethrown. Note that for errors which occur during a dialog ordocument transition, the scope in which errors are handled isplatform specific.
Attributes of <submit> include:
next | The URI reference. |
---|---|
expr | Like next, except that the URIreference is dynamically determined by evaluating the givenECMAScript expression. |
namelist | The list of variables to submit. Bydefault, all the named input item variables are submitted. If anamelist is supplied, it may contain individual variablereferences which are submitted with the same qualification usedin the namelist. Declared VoiceXML and ECMAScript variables canbe referenced. If an undeclared variable is referenced in thenamelist, then an error.semantic is thrown (Section 5.1.1). |
method | The request method: get (the default)or post. |
enctype | The media encoding type of thesubmitted document (when the value of method is "post"). Thedefault is application/x-www-form-urlencoded. Interpreters mustalso support multipart/form-data and may support additionalencoding types. |
fetchaudio | SeeSection 6.1. This defaults to the fetchaudioproperty. |
fetchhint | SeeSection 6.1. This defaults to thedocumentfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the documentmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to thedocumentmaxstale property. |
Exactly one of "next" or "expr" must be specified; otherwise,an error.badfetch event is thrown.
When an ECMAScript variable is submitted to the server itsvalue is first converted into a string before being submitted. Ifthe variable is an ECMAScript Object the mechanism by which it issubmitted is not currently defined. The mechanism of ECMAScriptObject submission is reserved for future definition. Instead ofsubmitting ECMAScript Objects directly, the application developermay explicitly submit properties of Object as in "date.monthdate.year".
If a <submit> contains a variable which referencesrecorded audio but does not contain an ENCTYPE ofmultipart/form-data, the behavior is not specified. It isprobably inappropriate to attempt to URL-encode large quantitiesof data.
Returns control to the interpreter context which determineswhat to do next.
<exit/>
This element differs from <return> in that it terminatesall loaded documents, while <return> returns from a<subdialog> invocation. If the <subdialog> caused anew document (or application) to be invoked, then <return>will cause that document to be terminated, but execution willresume after the <subdialog>.
Note that once <exit> returns control to the interpretercontext, the interpreter context is free to do as it wishes. Itmay play a top level menu for the user, drop the call, ortransfer the user to an operator, for example.
Attributes include:
expr | An ECMAScript expression that isevaluated as the return value (e.g. "0", "'oops!'", or"field1"). |
---|---|
namelist | Variable names to be returned tointerpreter context. The default is to return no variables; thismeans the interpreter context will receive an empty ECMAScriptobject. If an undeclared variable is referenced in the namelist,then an error.semantic is thrown (Section 5.1.1). |
Exactly one of "expr" or "namelist" may be specified;otherwise, an error.badfetch event is thrown.
The <exit> element does not throw an "exit" event.
Return ends execution of a subdialog and returns control anddata to a calling dialog.
The attributes are:
event | Return, then throw this event. |
---|---|
eventexpr | Return, then throw the event to whichthis ECMAScript expression evaluates. |
message | A message string providing additionalcontext about the event being thrown. The message is available asthe value of a variable within the scope of the catch element,seeSection 5.2.2. |
messageexpr | An ECMAScript expression evaluatingto the message string. |
namelist | Variable names to be returned tocalling dialog. The default is to return no variables; this meansthe caller will receive an empty ECMAScript object. If anundeclared variable is referenced in the namelist, then anerror.semantic is thrown (Section 5.1.1). |
Exactly one of "event", "eventexpr" or "namelist" may bespecified; otherwise, an error.badfetch event is thrown. Exactlyone of "message" or "messageexpr" may be specified; otherwise, anerror.badfetch event is thrown.
In returning from a subdialog, an event can be thrown at theinvocation point, or data is returned as an ECMAScript objectwith properties corresponding to the variable specified in itsnamelist. A return element that is encountered when not executingas a subdialog throws a semantic error.
The example below shows an event propagated from a subdialogto its calling dialog when the subdialog fails to obtain arecognizable result. It also shows data returned under normalconditions.
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <subdialog name="result" src="#getssn"> <nomatch> <!-- a no match event that is returned by the subdialog indicates that a valid Social Securityy number could not be matched. --> <goto next="http://myservice.example.com/ssn-problems.vxml"/> </nomatch> <filled> <submit namelist="result.ssn" next="http://myservice.example.com/cgi-bin/process"/> </filled> </subdialog> </form><form> <field name="ssn"> <grammar src="http://grammarlib/ssn.grxml" type="application/srgs+xml"/> <prompt> Please say Social Securityy number.</prompt> <nomatch count="3"> <return event="nomatch"/> </nomatch> <filled> <return namelist="ssn"/> </filled> </field> </form></vxml>
The subdialog event handler for <nomatch> is triggeredon the third failure to match; when triggered, it returns fromthe subdialog, and includes the nomatch event to be thrown in thecontext of the calling dialog. In this case, the calling dialogwill execute its <nomatch> handler, rather than the<filled> element, where the resulting action is to executea <goto> element. Under normal conditions, the<filled> element of the subdialog is executed after arecognized Social Securityy number is obtained, and then thisvalue is returned to the calling dialog, and is accessible asresult.ssn.
Causes the interpreter context to disconnect from the user. Asa result, the interpreter context will throw aconnection.disconnect.hangup event and enter the final processingstate (as described inSection 1.5.4). Processing the<disconnect> element will also flush the prompt queue (asdescribed inSection4.1.8).
The <script> element allows the specification of a blockof client-side scripting language code, and is analogous to the[HTML] <SCRIPT>element. For example, this document has a script that computes afactorial.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <script> <![CDATA[ function factorial(n) { return (n <= 1)? 1 : n * factorial(n-1); } ]]> </script> <form> <field name="fact"> <grammar type="application/srgs+xml" src="/grammars/number.grxml"/> <prompt> Tell me a number and I'll tell you its factorial. </prompt> <filled> <prompt> <value expr="fact"/> factorial is <value expr="factorial(fact)"/> </prompt> </filled> </field> </form> </vxml>
A <script> element may occur in the <vxml> and<form> elements, or in executable content (in<filled>, <if>, <block>, <catch>, or theshort forms of <catch>). Scripts in the <vxml>element are evaluated just after the document is loaded, alongwith the <var> elements, in document order. Scripts in the<form> element are evaluated in document order, along with<var> elements and form item variables, each time executionmoves into the <form> element. A <script> element inexecutable content is executed, like other executable elements,as it is encountered.
The <script> element has the following attributes:
src | The URI specifying the location ofthe script, if it is external. |
---|---|
charset | The character encoding of the scriptdesignated by src. UTF-8 and UTF-16 encodings of ISO/IEC 10646must be supported (as in[XML]) and other encodings, as defined in the[IANA], may be supported. Thedefault value is UTF-8. |
fetchhint | SeeSection 6.1. This defaults to thescriptfetchhint property. |
fetchtimeout | SeeSection 6.1. This defaults to the fetchtimeoutproperty. |
maxage | SeeSection 6.1. This defaults to the scriptmaxageproperty. |
maxstale | SeeSection 6.1. This defaults to the scriptmaxstaleproperty. |
Either an "src" attribute or an inline script (but not both)must be specified; otherwise, an error.badfetch event isthrown.
The VoiceXML <script> element (unlike the[HTML] <SCRIPT> element)does not have a type attribute; ECMAScript is the requiredscripting language for VoiceXML.
Each <script> element is executed in the scope of itscontaining element; i.e., it does not have its own scope. Thismeans for example that variables declared with var in the<script> element are declared in the scope of thecontaining element of the <script> element. (In ECMAScriptterminology, the "variable object" becomes the current scope ofthe containing element of the <script> element).
Here is a time-telling service with a block containing ascript that initializes time variables in the dialog scope of aform:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <var name="hours"/> <var name="minutes"/> <var name="seconds"/> <block> <script> var d = new Date(); hours = d.getHours(); minutes = d.getMinutes(); seconds = d.getSeconds(); </script> </block> <field name="hear_another"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <prompt> The time is <value expr="hours"/> hours, <value expr="minutes"/> minutes, and <value expr="seconds"/> seconds. </prompt> <prompt>Do you want to hear another time?</prompt> <filled> <if cond="hear_another"> <clear/> </if> </filled> </field> </form> </vxml>
The content of a <script> element is evaluated in thesame scope as a <var> element (see5.1.2 Variable Scopes and5.3.1 VAR).
The ECMAScript scope chain (see section 10.1.4 in[ECMASCRIPT]) is set upso that variables declared either with <var> or inside<script> are put into the scope associated with the elementin which the <var> or <script> element occurs. Forexample, the variable declared in a <script> element undera <form> element has a dialog scope, and can be accessed asa dialog scope variable as follows:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <script> var now = new Date(); <!-- this has a dialog scope--> </script> <var name="seconds" expr="now.getSeconds()"/> <!-- this has a dialog scope--> <block> <var name="now" expr="new Date()"/> <!-- this has an anonymous scope --> <script> var current = now.getSeconds(); <!-- "now" in the anonymous scope --> var approx = dialog.now.getSeconds(); <!-- "now" in the dialog scope --> </script> </block> </form> </vxml>
All variables must be declared before being referenced byECMAScript scripts, or by VoiceXML elements as described inSection 5.1.1.
The <log> element allows an application to generate alogging or debug message which a developer can use to help inapplication development or post-execution analysis of applicationperformance.
The <log> element may contain any combination of text(CDATA) and <value> elements. The generated messageconsists of the concatenation of the text and the string form ofthe value of the "expr" attribute of the <value>elements.
The manner in which the message is displayed or logged isplatform-dependent. The usage of label is platform-dependent.Platforms are not required to preserve white space.
ECMAScript expressions in <log> must be evaluated indocument order. The use of the <log> element should have noother side-effects on interpretation.
<log>The card number was <value expr="card_num"/></log>
The <log> element has the following attributes:
label | An optional string whichmay be used, for example, to indicate the purpose of thelog. |
---|---|
expr | An optional ECMAScript expressionevaluating to a string. |
A VoiceXML interpreter context needs to fetch VoiceXMLdocuments, and other resources, such as audio files, grammars,scripts, and objects. Each fetch of the content associated with aURI is governed by the following attributes:
fetchtimeout | The interval to wait for the contentto be returned before throwing an error.badfetch event. Thevalue is a Time Designation (seeSection 6.5). If not specified, a valuederived from the innermost fetchtimeout property is used. |
---|---|
fetchhint | Defines when the interpreter contextshould retrieve content from the server. prefetch indicates afile may be downloaded when the page is loaded, whereas safeindicates a file that should only be downloaded when actuallyneeded. If not specified, a value derived from the innermostrelevant fetchhint property is used. |
maxage | Indicates that the document iswilling to use content whose age is no greater than the specifiedtime in seconds (cf. 'max-age' in HTTP 1.1[RFC2616]). The document isnot willing to use stale content, unless maxstale is also provided.If not specified, a value derived from the innermost relevantmaxage property, if present, is used. |
maxstale | Indicates that the document iswilling to use content that has exceeded its expiration time(cf. 'max-stale' in HTTP 1.1[RFC2616]). If maxstale is assigned a value,then the document is willing to accept content that hasexceeded its expiration time by no more than the specified numberof seconds. If not specified, a value derived from the innermostrelevant maxstale property, if present, is used. |
When content is fetched from a URI, the fetchtimeout attributedetermines how long to wait for the content (starting from thetime when the resource is needed), and the fetchhint attributedetermines when the content is fetched. The caching policy for aVoiceXML interpreter context utilizes the maxage and maxstaleattributes and is explained in more detail below.
The fetchhint attribute, in combination with the variousfetchhint properties, is merely a hint to the interpreter contextabout when it may schedule the fetch of a resource. Tellingthe interpreter context that it may prefetch a resource does notrequire that the resource be prefetched; it only suggests thatthe resourcemay be prefetched. However, the interpretercontext is always required to honor the safe fetchhint.
When transitioning from one dialog to another, through eithera <subdialog>, <goto>, <submit>, <link>,or <choice> element, there are additional rules that affectinterpreter behavior. If the referenced URI names a document(e.g. "doc#dialog"), or if query data is provided (through POSTor GET), then a new document is obtained (either from a localcache, intermediate cache, or from a origin Web server). When itis obtained, the document goes through its initialization phase(i.e., obtaining and initializing a new application rootdocument if needed, initializing document variables, andexecuting document scripts). The requested dialog (or firstdialog if none is specified) is then initialized and executionof the dialog begins.
Generally, if a URI reference contains only a fragment (e.g.,"#my_dialog"), then no document is fetched, and no initializationof that document is performed. However, <submit> alwaysresults in a fetch, and if a fragment is accompanied by anamelist attribute there will also be a fetch.
Another exception is when a URI reference in a leaf documentreferences the application root document. In this case, the rootdocument is transitioned to without fetching and withoutinitialization even if the URI reference contains an absolute orrelative URI (seeSection1.5.2 and[RFC2396]).However, if the URI reference to the root document contains aquery string or a namelist attribute, the root document isfetched.
Elements that fetch VoiceXML documents also support thefollowing additional attribute:
fetchaudio | The URI of the audio clipto play while the fetch is being done. If not specified, thefetchaudio property is used, and if that property is not set, noaudio is played during the fetch. The fetching of the audio clipis governed by the audiofetchhint, audiomaxage, audiomaxstale,and fetchtimeout properties in effect at the time of the fetch.The playing of the audio clip is governed by the fetchaudiodelay,and fetchaudiominimum properties in effect at the time of thefetch. |
---|
The fetchaudio attribute is useful for enhancing a userexperience when there may be noticeable delays while the nextdocument is retrieved. This can be used to play background music,or a series of announcements. When the document is retrieved, theaudio file is interrupted if it is still playing. If an erroroccurs retrieving fetchaudio from its URI, no badfetch event isthrown and no audio is played during the fetch.
The VoiceXML interpreter context, like[HTML] visual browsers, can use caching toimprove performance in fetching documents and other resources;audio recordings (which can be quite large) are as common toVoiceXML documents as images are to HTML pages. In a visualbrowser it is common to include end user controls to update orrefresh content that is perceived to be stale. This is not thecase for the VoiceXML interpreter context, since it lacksequivalent end user controls. Thus enforcement of cache refreshis at the discretion of the document through appropriate use ofthe maxage, and maxstale attributes.
The caching policy used by the VoiceXML interpreter contextmust adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular,the Expires and Cache-Control headers must be honored. Thefollowing algorithm summarizes these rules and represents theinterpreter context behavior when requesting a resource:
The "maxstale check" is:
Note: it is an optimization to perform a "get if modified" ona document still present in the cache when the policy requires afetch from the server.
The maxage and maxstale properties are allowed to have nodefault value whatsoever. If the value is not provided by thedocument author, and the platform does not provide a defaultvalue, then the value is undefined and the 'Otherwise' clause ofthe algorithm applies. All other properties must provide adefault value (either as given by the specification or by theplatform).
While the maxage and maxstale attributes are drawn from anddirectly supported by HTTP 1.1, some resources may be addressedby URIs that name protocols other than HTTP. If the protocol doesnot support the notion of resource age, the interpreter contextshall compute a resource's age from the time it was received. Ifthe protocol does not support the notion of resource staleness,the interpreter context shall consider the resource to haveexpired immediately upon receipt.
VoiceXML allows the author to override the default cachingbehavior for each use of each resource (except for any documentreferenced by the <vxml> element's application attribute:there is no markup mechanism to control the caching policy foran application root document).
Each resource-related element may specify maxage and maxstaleattributes. Setting maxage to a non-zero value can be used to geta fresh copy of a resource that may not have yet expired in thecache. A fresh copy can be unconditionally requested by settingmaxage to zero.
Using maxstale enables the author to state that an expiredcopy of a resource, that is not too stale (according to the rulesof HTTP 1.1), may be used. This can improve performance byeliminating a fetch that would otherwise be required to get afresh copy. It is especially useful for authors who may not havedirect server-side control of the expiration dates of largestatic files.
Prefetching is an optional feature that an interpreter contextmay implement to obtain a resource before it is needed. Aresource that may be prefetched is identified by an element whosefetchhint attribute equals "prefetch". When an interpretercontext does prefetch a resource, it must ensure that theresource fetched is precisely the one needed. In particular, ifthe URI is computed with an expr attribute, the interpretercontext must not move the fetch up before any assignments to theexpression's variables. Likewise, the fetch for a <submit>must not be moved prior to any assignments of the namelistvariables.
The expiration status of a resource must be checked on eachuse of the resource, and, if its fetchhint attribute is"prefetch", then it is prefetched. The check must follow thecaching policy specified in Section 6.1.2.
The "http" URI scheme must be supported by VoiceXMLplatforms, the "https" protocol should be supported and other URIprotocols may be supported.
Metadata information is information about the document ratherthan the document's content. VoiceXML 2.0 provides two elementsin which metadata information can be expressed: <meta> and<metadata>. The <metadata> element provides moregeneral and powerful treatment of metadata information than<meta>.
VoiceXML does not specify required metadata information.However, it does recommend that metadata is expressed usingthe <metadata> element with information in ResourceDescription Framework (RDF)[RDF-SYNTAX] using the Dublin Core version 1.0RDF schema[DC] (seeSection 6.2.2).
The <meta> element specifies meta information as in[HTML]. There are two types of<meta>.
The first type specifies a metadata property of the documentas a whole and is expressed by the pair of attributes, nameand content. For example to specify the maintainer of aVoiceXML document:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <meta name="maintainer" content="jpdoe@anycompany.example.com"/> <form> <block> <prompt>Hello</prompt> </block> </form> </vxml>
The second type of <meta> specifies HTTP responseheaders and is expressed by the pair of attributeshttp-equiv and content. In the following example, thefirst <meta> element sets an expiration date that preventscaching of the document; the second <meta> element sets theDate header.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <meta http-equiv="Expires" content="0"/> <meta http-equiv="Date" content="Thu, 12 Dec 2000 23:27:21 GMT"/> <form> <block> <prompt>Hello</prompt> </block> </form> </vxml>
Attributes of <meta> are:
name | The name of the metadataproperty. |
---|---|
content | The value of the metadataproperty. |
http-equiv | The name of an HTTPresponse header. |
Exactly one of "name" or "http-equiv" must be specified;otherwise, an error.badfetch event is thrown.
The <metadata> element is container in which informationabout the document can be placed using a metadata schema.Although any metadata schema can be used with <metadata>,it is recommended that the RDF schema is used in conjunction withmetadata properties defined in the Dublin Core MetadataInitiative.
RDF is a declarative language and provides a standard way forusing XML to represent metadata in the form of statements aboutproperties and relationships of items on the Web. Contentcreators should refer to W3C metadata Recommendations[RDF-SYNTAX] and[RDF-SCHEMA] as well asthe Dublin Core Metadata Initiative[DC], which is a set of generally applicablecore metadata properties (e.g., Title, Creator, Subject,Description, Copyrights, etc.).
The following Dublin Core metadata properties are recommendedin <metadata>:
Creator | An entity primarilyresponsible for making the content of the resource. |
---|---|
Rights | Information about rightsheld in and over the resource. |
Subject | The topic of the contentof the resource. Typically, a subject will be expressed askeywords, key phrases or classification codes. Recommended bestpractice is to select values from a controlled vocabulary orformal classification scheme. |
Here is an example of how <metadata> can be included ina VoiceXML document using the Dublin Core version 1.0 RDF schema[DC]:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <metadata> <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/TR/1999/PR-rdf-schema-19990303#" xmlns:dc = "http://purl.org/metadata/dublin_core#"><!-- Metadata about the VoiceXML document --> <rdf:Description about="http://www.example.com/meta.vxml" dc:Title="Directory Enquiry Service" dc:Description="Directory Enquiry Service for London in VoiceXML" dc:Publisher="W3C" dc:Language="en" dc:Date="2002-02-12" dc:Rights="Copyright 2002 John Smith" dc:Format="application/voicexml+xml" > <dc:Creator> <rdf:Seq> <rdf:li>Jackie Crystal</rdf:li> <rdf:li>William Lee</rdf:li> </rdf:Seq> </dc:Creator> </rdf:Description> </rdf:RDF> </metadata> <form> <block> <prompt>Hello</prompt> </block> </form> </vxml>
The <property> element sets a property value. Propertiesare used to set values that affect platform behavior, such as therecognition process, timeouts, caching policy, etc.
Properties may be defined for the whole application, for thewhole document at the <vxml> level, for a particular dialogat the <form> or <menu> level, or for a particularform item. Properties apply to their parent element and all thedescendants of the parent. A property at a lower level overridesa property at a higher level. When different values for a propertyare specified at the same level, the last one in document orderapplies. Properties specified in the application root documentprovide default values for properties in every document in theapplication; properties specified in an individual documentoverride property values specified in the application rootdocument.
If a platform detects that the value of a property is invalid,then it should throw an error.semantic.
In some cases, <property> elements specify defaultvalues for element attributes, such as timeout or bargein. Forexample, to turn off bargein by default for all the prompts in aparticular form:
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <property name="bargein" value="false"/> <block> <prompt> This introductory prompt cannot be barged into. </prompt> <prompt> And neither can this prompt. </prompt> <prompt bargein="true"> But this one <emphasis>can</emphasis> be barged into. </prompt> </block> <field type="boolean"> <prompt> Please say yes or no. </prompt> </field> </form></vxml>
The <property> element has the following attributes:
name | The name of the property. |
---|---|
value | The value of the property. |
An interpreter context is free to provide platform-specificproperties. For example, to set the "multiplication factor"for this platform in the scope of this document:
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <property name="com.example.multiplication_factor" value="42"/> <block> <prompt> Welcome </prompt> </block></form> </vxml>
By definition, platform-specific properties introduceincompatibilities which reduce application portability.To minimize them, the following interpreter contextguidelines are strongly recommended:
Platform-specific properties should use reverse domain namesto eliminate potential collisions as in: com.example.foo,which is clearly different from net.example.foo
An interpreter context mustnot throw anerror.unsupported.property event when encountering a property itcannot process; rather the interpreter context mustjust ignore that property.
The generic speech recognizer properties mostly are taken fromthe Java Speech API[JSAPI]:
confidencelevel | The speech recognition confidencelevel, a float value in the range of 0.0 to 1.0. Results arerejected (a nomatch event is thrown) whenapplication.lastresult$.confidence is below this threshold.A value of 0.0 means minimum confidence is needed for arecognition, and a value of 1.0 requires maximum confidence.The value is a Real Number Designation (seeSection 6.5). The defaultvalue is 0.5. |
---|---|
sensitivity | Set the sensitivity level. A value of1.0 means that it is highly sensitive to quiet input. A value of0.0 means it is least sensitive to noise. The value is aReal Number Designation (seeSection 6.5). The default value is0.5. |
speedvsaccuracy | A hint specifying the desired balancebetween speed vs. accuracy. A value of 0.0 means fastestrecognition. A value of 1.0 means best accuracy. The valueis a Real Number Designation (seeSection 6.5). The default is value0.5. |
completetimeout | The length of silence required following user speech beforethe speech recognizer finalizes a result (either accepting it orthrowing a nomatch event). The complete timeout is used when thespeech is a complete match of an active grammar. By contrast, theincomplete timeout is used when the speech is an incomplete matchto an active grammar. A long complete timeout value delays the result completion andtherefore makes the computer's response slow. A short completetimeout may lead to an utterance being broken up inappropriately.Reasonable complete timeout values are typically in the range of0.3 seconds to 1.0 seconds. The value is a Time Designation(seeSection 6.5). Thedefault is platform-dependent. SeeAppendix D. Although platforms must parse the completetimeout property,platforms are not required to support the behavior ofcompletetimeout. Platforms choosing not to support the behaviorof completetimeout must so document and adjust the behavior oftheincompletetimeout property as described below. |
incompletetimeout | The required length of silence following user speech afterwhich a recognizer finalizes a result. The incomplete timeoutapplies when the speech prior to the silence is an incompletematch of all active grammars. In this case, once thetimeout is triggered, the partial result is rejected (with anomatch event). The incomplete timeout also applies when the speech prior tothe silence is a complete match of an active grammar, but whereit is possible to speak further and still match the grammar. Bycontrast, the complete timeout is used when the speech is acomplete match to an active grammar and no further words can bespoken. A long incomplete timeout value delays the result completionand therefore makes the computer's response slow. A shortincomplete timeout may lead to an utterance being broken upinappropriately. The incomplete timeout is usually longer than the completetimeout to allow users to pause mid-utterance (for example, tobreathe). SeeAppendixD. Platforms choosing not to support thecompletetimeoutproperty (described above) must use the maximum of thecompletetimeout and incompletetimeout values as the value for theincompletetimeout. The value is a Time Designation (seeSection 6.5). |
maxspeechtimeout | The maximum duration of user speech. If this time elapsedbefore the user stops speaking, the event "maxspeechtimeout" isthrown. The value is a Time Designation (seeSection 6.5). The defaultduration is platform-dependent. |
Several generic properties pertain to DTMF grammarrecognition:
interdigittimeout | The inter-digit timeoutvalue to use when recognizing DTMF input. The value is a TimeDesignation (seeSection6.5). The default is platform-dependent. SeeAppendix D. |
---|---|
termtimeout | The terminating timeoutto use when recognizing DTMF input. The value is a TimeDesignation (seeSection6.5). The default value is "0s".Appendix D. |
termchar | The terminating DTMFcharacter for DTMF input recognition. The default value is "#".SeeAppendix D. |
These properties apply to the fundamental platform prompt andcollect cycle:
bargein | The bargein attribute touse for prompts. Setting this to true allows bargein by default.Setting it to false disallows bargein. The default value is"true". |
---|---|
bargeintype | Sets the type of bargeinto be speech or hotword. Default is platform-specific. SeeSection 4.1.5.1. |
timeout | The time after which anoinput event is thrown by the platform. The value is aTime Designation (seeSection6.5). The default value is platform-dependent. SeeAppendix D. |
These properties pertain to the fetching of new documents andresources (note that maxage and maxstale properties may have nodefault value - seeSection 6.1.2):
audiofetchhint | This tells the platformwhether or not it can attempt to optimize dialog interpretationby pre-fetching audio. The value is either safe to say that audiois only fetched when it is needed, never before; or prefetch topermit, but not require the platform to pre-fetch the audio. Thedefault value is prefetch. |
---|---|
audiomaxage | Tells the platform themaximum acceptable age, in seconds, of cached audio resources.The default is platform-specific. |
audiomaxstale | Tells the platform themaximum acceptable staleness, in seconds, of expired cached audioresources. The default is platform-specific. |
documentfetchhint | Tells the platformwhether or not documents may be pre-fetched. The value is eithersafe (the default), or prefetch. |
documentmaxage | Tells the platform themaximum acceptable age, in seconds, of cached documents. Thedefault is platform-specific. |
documentmaxstale | Tells the platform themaximum acceptable staleness, in seconds, of expired cacheddocuments. The default is platform-specific. |
grammarfetchhint | Tells the platformwhether or not grammars may be pre-fetched. The value is eitherprefetch (the default), or safe. |
grammarmaxage | Tells the platform themaximum acceptable age, in seconds, of cached grammars. Thedefault is platform-specific. |
grammarmaxstale | Tells the platform themaximum acceptable staleness, in seconds, of expired cachedgrammars. The default is platform-specific. |
objectfetchhint | Tells the platformwhether the URI contents for <object> may be pre-fetched ornot. The values are prefetch (the default), or safe. |
objectmaxage | Tells the platform themaximum acceptable age, in seconds, of cached objects. Thedefault is platform-specific. |
objectmaxstale | Tells the platform themaximum acceptable staleness, in seconds, of expired cachedobjects. The default is platform-specific. |
scriptfetchhint | Tells whether scripts maybe pre-fetched or not. The values are prefetch (the default), orsafe. |
scriptmaxage | Tells the platform themaximum acceptable age, in seconds, of cached scripts. Thedefault is platform-specific. |
scriptmaxstale | Tells the platform themaximum acceptable staleness, in seconds, of expired cachedscripts. The default is platform-specific. |
fetchaudio | The URI of the audio toplay while waiting for a document to be fetched. The default isnot to play any audio during fetch delays. There are nofetchaudio properties for audio, grammars, objects, and scripts.The fetching of the audio clip is governed by the audiofetchhint,audiomaxage, audiomaxstale, and fetchtimeout properties in effectat the time of the fetch. The playing of the audio clip isgoverned by the fetchaudiodelay, and fetchaudiominimum propertiesin effect at the time of the fetch. |
fetchaudiodelay | The time interval to wait at the start of a fetch delay beforeplaying the fetchaudio source. The value is a TimeDesignation (seeSection6.5). The default interval is platform-dependent, e.g."2s". The idea is that when a fetch delay is short, it maybe better to have a few seconds of silence instead of a bit offetchaudio that is immediately cut off. |
fetchaudiominimum | The minimum time interval to play a fetchaudio source, oncestarted, even if the fetch result arrives in the meantime.The value is a Time Designation (seeSection 6.5). The default isplatform-dependent, e.g., "5s". The idea is that once theuser does begin to hear fetchaudio, it should not be stopped tooquickly. |
fetchtimeout | The timeout for fetches.The value is a Time Designation (seeSection 6.5). The default value isplatform-dependent. |
inputmodes | This property determineswhich input modality to use. The input modes to enable: dtmf andvoice. On platforms that support both modes, inputmodes defaultsto "dtmf voice". To disable speech recognition, set inputmodes to"dtmf". To disable DTMF, set it to "voice". One use for thiswould be to turn off speech recognition in noisy environments.Another would be to conserve speech recognition resources byturning them off where the input is always expected to be DTMF.This property does not control the activation of grammars. Forinstance, voice-only grammars may be active when the inputmode isrestricted to DTMF. Those grammars would not be matched, however,because the voice input modality is not active. |
---|---|
universals | Platforms may optionally provide platform-specific universalcommand grammars, such as "help", "cancel", or "exit" grammars,that are always active (except in the case of modal inputitems - seeSection 3.1.4)and which generate specific events. Production-grade applications often need to define their ownuniversal command grammars, e.g., to increase applicationportability or to provide a distinctive interface. They specifynew universal command grammars with <link> elements. Theyturn off the default grammars with this property. Default catchhandlers are not affected by this property. The value "none" is the default, and means that all platformdefault universal command grammars are disabled. The value "all"turns them all on. Individual grammars are enabled by listingtheir names separated by spaces; for example, "cancel exithelp". |
maxnbest | This property controls the maximum size of the"application.lastresult$" array; the array is constrained to beno larger than the value specified by 'maxnbest'. This propertyhas a minimum value of 1. The default value is 1. |
Our last example shows several of these properties used atmultiple levels.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <!-- set default characteristics for page --> <property name="audiofetchhint" value="safe"/> <property name="confidencelevel" value="0.75"/> <form> <!-- override defaults for this form only --> <property name="confidencelevel" value="0.5"/> <property name="bargein" value="false"/> <grammar src="address_book.grxml" type="application/srgs+xml"/> <block> <prompt> Welcome to the Voice Address Book </prompt> </block> <initial name="start"> <!-- override default timeout value --> <property name="timeout" value="5s"/> <prompt> Who would you like to call? </prompt> </initial> <field name="person"> <prompt> Say the name of the person you would like to call. </prompt> </field> <field name="location"> <prompt> Say the location of the person you would like to call. </prompt> </field> <field name="confirm"> <grammar type="application/srgs+xml" src="/grammars/boolean.grxml"/> <!-- Use actual utterances to playback recognized words, rather than returned slot values --> <prompt> You said to call <value expr="person$.utterance"/> at <value expr="location$.utterance"/>. Is this correct? </prompt> <filled> <if cond="confirm"> <submit namelist="person location" next="http://www.messagecentral.example.com/voice/make_call" /> </if> <clear/> </filled> </field> </form> </vxml>
The <param> element is used to specify values that arepassed to subdialogs or objects. It is modeled on the[HTML] <PARAM> element.Its attributes are:
name | The name to be associatedwith this parameter when the object or subdialog is invoked. |
---|---|
expr | An expression thatcomputes the value associated with name. |
value | Associates a literalstring value with name. |
valuetype | One of data or ref, bydefault data; used to indicate to an object if the valueassociated with name is data or a URI (ref). This is not used for<subdialog> since values are always data. |
type | The media type of theresult provided by a URI if the valuetype is ref; only relevantfor uses of <param> in <object>. |
Exactly one of "expr" or "value" must be specified; otherwise,an error.badfetch event is thrown.
The use of valuetype and type is optional in general, althoughthey may be required by specific objects. When <param> iscontained in a <subdialog> element, the values specified byit are used to initialize dialog <var> elements in thesubdialog that is invoked. SeeSection 2.3.4 for details regardinginitialization of variables in subdialogs using<param>. When <param> is contained in an<object>, the use of the parameter data is specific to theobject that is being invoked, and is outside the scope of theVoiceXML specification.
Below is an example of <param> used as part of an<object>. In this case, the first two <param>elements have expressions (implicitly of valuetype="data"), thethird <param> has an explicit value, and the fourth is aURI that returns a media type of text/plain. The meaning of thisdata is specific to the object.
<object name="debit" classid="method://credit-card/gather_and_debit" data="http://www.recordings.example.com/prompts/credit/jesse.jar"> <param name="amount" expr="document.amt"/> <param name="vendor" expr="vendor_num"/> <param name="application_id" value="ADC5678-QWOO"/> <param name="authentication_server" value="http://auth-svr.example.com" valuetype="ref" type="text/plain"/> </object>
The next example illustrates <param> used with<subdialog>. In this case, two expressions are used toinitialize variables in the scope of the subdialog form.
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <subdialog name="result" src="http://another.example.com/#getssn"> <param name="firstname" expr="document.first"/> <param name="lastname" expr="document.last"/> <filled> <submit namelist="result.ssn" next="http://myservice.example.com/cgi-bin/process"/> </filled> </subdialog> </form></vxml>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"><form> <var name="firstname"/> <var name="lastname"/> <field name="ssn"> <grammar src="http://grammarlib/ssn.grxml" type="application/srgs+xml"/> <prompt> Please say Social Securityy number. </prompt> <filled> <if cond="validssn(firstname,lastname,ssn)"> <assign name="status" expr="true"/> <return namelist="status ssn"/> <else/> <assign name="status" expr="false"/> <return namelist="status"/> </if> </filled> </field> </form></vxml>
Using <param> in a <subdialog> is a convenient wayof passing data to a subdialog without requiring the use ofserver side scripting.
Several VoiceXML parameter values follow the conventions usedin the W3C's Cascading Style Sheet Recommendation[CSS2].
Real numbers and integers are specified in decimal notationonly. An integer consists of one or more digits "0" to "9". Areal number may be an integer, or it may be zero or more digitsfollowed by a dot (.) followed by one or more digits. Bothintegers and real numbers may be preceded by a "-" or "+" toindicate the sign.
Time designations consist of a non-negative real numberfollowed by a time unit identifier. The time unit identifiersare:
ms: milliseconds
s: seconds
Examples include: "3s", "850ms", "0.7s", ".5s"and "+1.5s".
The VoiceXML DTD is located athttp://www.w3.org/TR/voicexml20/vxml.dtd.
Due to DTD limitations, the VoiceXML DTD does not correctlyexpress that the <metadata> element can contain elementsfrom other XML namespaces.
Note: the VoiceXML DTD includes modified elements from theDTDs of the Speech Recognition Grammar Specification 1.0[SRGS] and the Speech SynthesisMarkup Language 1.0[SSML].
The form interpretation algorithm (FIA) drives the interactionbetween the user and a VoiceXML form or menu. A menu can beviewed as a form containing a single field whose grammar andwhose <filled> action are constructed from the<choice> elements.
The FIA must handle:
Form initialization.
Prompting, including the management of the prompt countersneeded for prompt tapering.
Grammar activation and deactivation at the form and form itemlevels.
Entering the form with an utterance that matched one of theform's document-scoped grammars while the user was visitinga different form or menu.
Leaving the form because the user matched another form, menu,or link's document-scoped grammar.
Processing multiple field fills from one utterance, includingthe execution of the relevant <filled> actions.
Selecting the next form item to visit, and then processingthat form item.
Choosing the correct catch element to handle any events thrownwhile processing a form item.
First we define some terms and data structures used in theform interpretation algorithm:
Here is the conceptual form interpretation algorithm. The FIAcan start with no initial utterance, or with an initial utterancepassed in from another dialog:
////Initialization Phase//foreach ( <var>, <script> and form item, in document order ) if ( the element is a <var> ) Declare the variable, initializing it to the value of the "expr" attribute, if any, or else to undefined. else if ( the element is a <script> ) Evaluate the contents of the script if inlined or else from the location specified by the "src" attribute. else if ( the element is a form item ) Create a variable from the "name" attribute, if any, or else generate an internal name. Assign to this variable the value of the "expr" attribute, if any, or else undefined. foreach ( input item and <initial> element ) Declare a prompt counter and set it to 1.if ( user entered this form by speaking to its grammar while in a different form){ Enter the main loop below, but start in the process phase, not the select phase: we already have a collection to process.}////Main Loop: select next form item and execute it.//while ( true ){ // //Select Phase: choose a form item to visit. // if ( the last main loop iteration ended with a <goto nextitem> ) Select that next form item. else if (there is a form item with an unsatisfied guard condition ) Select the first such form item in document order. else Do an <exit/> -- the form is full and specified no transition. // //Collect Phase: execute the selected form item. // // Queue up prompts for the form item. unless ( the last loop iteration ended with a catch that had no <reprompt>, and the active dialog was not changed ) { Select the appropriate prompts for an input item or <initial>. Queue the selected prompts for play prior to the next collect operation. Increment an input item's or <initial>'s prompt counter. } // Activate grammars for the form item. if ( the form item is modal ) Set the active grammar set to the form item grammars, if any. (Note that some form items, e.g. <block>, cannot have any grammars). else Set the active grammar set to the form item grammars and any grammars scoped to the form, the current document, and the application root document. // Execute the form item. if ( a <field> was selected ) Collect an utterance or an event from the user. else if ( a <record> was chosen ) Collect an utterance (with a name/value pair for the recorded bytes) or event from the user. else if ( an <object> was chosen ) Execute the object, setting the <object>'s form item variable to the returned ECMAScript value. else if ( a <subdialog> was chosen ) Execute the subdialog, setting the <subdialog>'s form item variable to the returned ECMAScript value. else if ( a <transfer> was chosen ) Do the transfer, and (if wait is true) set the <transfer> form item variable to the returned result status indicator. else if ( an <initial> was chosen ) Collect an utterance or an event from the user. else if ( a <block> was chosen ) { Set the block's form item variable to a defined value. Execute the block's executable context. } // //Process Phase: process the resulting utterance or event. // Assign the utterance and other information about the last recognition to application.lastresult$. // Must have an utterance if ( the utterance matched a grammar belonging to a <link> ) If the link specifies an "next" or "expr" attribute, transition to that location. Else if the link specifies an "event" or "eventexpr" attribute, generate that event. else if ( the utterance matched a grammar belonging to a <choice> ) If the choice specifies an "next" or "expr" attribute, transition to that location. Else if the choice specifies an "event" or "eventexpr" attribute, generate that event. else if ( the utterance matched a grammar from outside the current <form> or <menu> ) { Transition to that <form> or <menu>, carrying the utterance to the new FIA. } // Process an utterance spoken to a grammar from this form. // First copy utterance result property values into corresponding // form item variables. Clear all "just_filled" flags. if ( the grammar is scoped to the field-level ) { // This grammar must be enclosed in an input item. The input item // has an associated ECMAScript variable (referred to here as the input // item variable) and slot name. if ( the result is not a structure ) Copy the result into the input item variable. elseif ( a top-level property in the result matches the slot name or the slot name is a dot-separated path matching a subproperty in the result ) Copy the value of that property into the input item variable. else Copy the entire result into the input item variable Set this input item's "just_filled" flag. } else { foreach ( property in the user's utterance ) { if ( the property matches an input item's slot name ) { Copy the value of that property into the input item's form item variable. Set the input item's "just_filled" flag. } } } // Set all <initial> form item variables if any input items are filled. if ( any input item variable is set as a result of the user utterance ) Set all <initial> form item variables to true. // Next execute any triggered <filled> actions. foreach ( <filled> action in document order ) { // Determine the input item variables the <filled> applies to. N = the <filled>'s "namelist" attribute. if ( N equals "" ) { if ( the <filled> is a child of an input item ) N = the input item's form item variable name. else if ( the <filled> is a child of a form ) N = the form item variable names of all the input items in that form. } // Is the <filled> triggered? if ( any input item variable in the set N was "just_filled" AND ( the <filled> mode is "all" AND all variables in N are filled OR the <filled> mode is "any" AND any variables in N are filled) ) Execute the <filled> action. If an event is thrown during the execution of a <filled>, event handler selection starts in the scope of the <filled>, which could be an input item or the form itself. } // If no input item is filled, just continue.}
During FIA execution, events may be generated at severalpoints. These events are processed differently depending on whichphase is active.
Before a form item is selected (i.e. during the Initializationand Select phases), events are generated at the dialog level. Thecorresponding catch handler is located and executed. If the catchdoes not result in a transition from the current dialog, FIAexecution will terminate.
Similarly, events triggered after a form item is selected(i.e. during the Collect and Process phases) are usuallygenerated at the form item level. There is one exception: eventstriggered by a dialog level <filled> are generated at thedialog level. The corresponding catch handler is located andexecuted. If the catch does not result in a transition, thecurrent FIA loop is terminated and Select phase is reentered.
The various timing properties for speech and DTMF recognitionwork together to define the user experience. The ways in whichthese different timing parameters function are outlined in thetiming diagrams below. In these diagrams, the start for wait ofDTMF input, or user speech both occur at the time that the lastprompt has finished playing.
DTMF grammars use timeout, interdigittimeout, termtimeout andtermchar as described inSection6.3.3 to tailor the user experience. The effects of these areshown in the following timing diagrams.
The timeout parameter determines when the <noinput>event is thrown because the user has failed to enter any DTMF(Figure 12). Once the first DTMF has been entered, thisparameter has no further effect.
Figure 12: Timing diagram for timeout when no input provided.
In Figure 13, the interdigittimeout determines when thenomatch event is thrown because a DTMF grammar is not yetrecognized, and the user has failed to enter additional DTMF.
Figure 13: Timing diagram for interdigittimeout, grammar is notready to terminate.
The example below shows the situation when a DTMF grammarcould terminate, or extend by the addition of more DTMF input,and the user has elected not to provide any further input.
Figure 14: Timing diagram for interdigittimeout, grammar is readyto terminate.
In the example below, a termchar is non-empty, and is enteredby the user before an interdigittimeout expires, to signify thatthe users DTMF input is complete; the termchar is not included aspart of the recognized value.
Figure 15: Timing diagram for termchar and interdigittimeout,grammar can terminate.
In the example below, the entry of the last DTMF has broughtthe grammar to a termination point at which no additional DTMF isexpected. Since termchar is empty, there is no optionalterminating character permitted, thus the recognition ends andthe recognized value is returned.
Figure 16: Timing diagram for termchar empty when grammar mustterminate.
In the example below, the entry of the last DTMF has broughtthe grammar to a termination point at which no additional DTMF isallowed by the grammar. If the termchar is non-empty, then theuser can enter an optional termchar DTMF. If the user fails toenter this optional DTMF within termtimeout, the recognition endsand the recognized value is returned. If the termtimeout is 0s(the default), then the recognized value is returned immediatelyafter the last DTMF allowed by the grammar, without waiting forthe optional termchar. Note: the termtimeout applies onlywhen no additional input is allowed by the grammar; otherwise,the interdigittimeout applies.
Figure 17: Timing diagram for termchar non-empty and termtimeoutwhen grammar must terminate.
In this example, the entry of the last DTMF has brought thegrammar to a termination point at which no additional DTMF isallowed by the grammar. Since the termchar is non-empty, the userenters the optional termchar within termtimeout causing therecognized value to be returned (excluding the termchar).
Figure 18: Timing diagram for termchar non-empty when grammarmust terminate.
While waiting for the first or additional DTMF, threedifferent timeouts may determine when the user's input isconsidered complete. If no DTMF has been entered, the timeoutapplies; if some DTMF has been entered but additional DTMF isvalid, then the interdigittimeout applies; and if no additionalDTMF is legal, then the termtimeout applies. At each point, theuser may enter DTMF which is not permitted by the activegrammar(s). This causes the collected DTMF string to be invalid.Additional digits will be collected until either the termcharis pressed or the interdigittimeout has elapsed. A nomatch eventis then generated.
Speech grammars use timeout, completetimeout, andincompletetimeout as described inSection 6.3.4 andSection 6.3.2 to tailor the user experience. Theeffects of these are shown in the following timing diagrams.
In the example below, the timeout parameter determines whenthe noinput event is thrown because the user has failed tospeak.
Figure 19: Timing diagram for timeout when no speechprovided.
In the example above, the user provided a utterance that wasrecognized by the speech grammar. After a silence period ofcompletetimeout has elapsed, the recognized value isreturned.
Figure 20: Timing diagram for completetimeout with speech grammarrecognized.
In the example above, the user provided a utterance that isnot as yet recognized by the speech grammar but is the prefix ofa legal utterance. After a silence period of incompletetimeouthas elapsed, a nomatch event is thrown.
Figure 21: Timing diagram for incompletetimeout with speechgrammar unrecognized.
VoiceXML requires that a platform support the playing andrecording audio formats specified below.
Audio Format | Media Type |
---|---|
Raw (headerless) 8kHz 8-bit monomu-law [PCM] single channel. (G.711) | audio/basic (from[RFC1521]) |
Raw (headerless) 8kHz 8 bit monoA-law [PCM] single channel. (G.711) | audio/x-alaw-basic |
WAV (RIFF header) 8kHz 8-bit monomu-law [PCM] single channel. | audio/x-wav |
WAV (RIFF header) 8kHz 8-bit monoA-law [PCM] single channel. | audio/x-wav |
The 'audio/basic' mime type is commonly used with the 'au'header format as well as the headerless 8-bit 8Khz mu-law format.If this mime type is specified for recording, the mu-law formatmust be used. For playback with the 'audio/basic' mime type,platforms must support the mu-law format and may support the 'au'format.
This section is Normative.
Aconforming VoiceXML document is a well-formed[XML] document that requiresonly the facilities described as mandatory in this specification.Such a document must meet all of the following criteria:
The document must conform to the constraints expressed in theVoiceXML Schema (AppendixO).
The root element of the document must be <vxml>.
The <vxml> element must include a "version" attributewith the value "2.0".
The <vxml> element must designate the VoiceXMLnamespace. This can be achieved by declaring an "xmlns"attribute or an attribute with an "xmlns" prefix[XMLNAMES]. The namespacefor VoiceXML is defined to be http://www.w3.org/2001/vxml.
It is recommended that the <vxml> element alsoindicate the location of the VoiceXML schema (seeAppendix O) via thexsi:schemaLocation attribute from[SCHEMA1]:
xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"
Although such indication is not required, to encourage it thisdocument provides such indication on all of the examples.
There may be a DOCTYPE declaration in the document prior tothe root element. If present, the public identifier included inthe DOCTYPE declaration must reference the VoiceXML DTD (Appendix B) using its FormalPublic Identifier.
<!DOCTYPE vxml PUBLIC "-//W3C//DTD VOICEXML 2.0//EN" "http://www.w3.org/TR/voicexml20/vxml.dtd">
The system identifier may be modified appropriately.
The DTD subset must not be used to override any parameterentities in the DTD.
Here is an example of a Conforming VoiceXML document:
<?xml version="1.0" encoding="UTF-8"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <block>hello</block> </form></vxml>
Note that in this example, the recommended "xmlns:xsi" and"xsi:schemaLocation" attributes are included as is an XMLdeclaration. An XML declaration like the one above is notrequired in all XML documents. VoiceXML document authors arestrongly encouraged to use XML declarations in all theirdocuments. Such a declaration is required when the characterencoding of the document is other than the default UTF-8 orUTF-16 and no encoding was determined by a higher-levelprotocol.
The VoiceXML language or these conformance criteria provide nodesignated size limits on any aspect of VoiceXML documents. Thereare no maximum values on the number of elements, the amount ofcharacter data, or the number of characters in attributevalues.
The VoiceXML namespace may be used with other XML namespacesas per[XMLNAMES],although such documents are not strictly conforming VoiceXMLdocuments as defined above. Future work by W3C will address waysto specify conformance for documents involving multiplenamespaces.
A VoiceXML processor is a user agent that can parse andprocess Conforming VoiceXML documents.
In aConforming VoiceXML Processor, the XML parsermust be able to parse and process all well-formed XML constructsdefined within[XML] and[XMLNAMES]. It is notrequired that a Conforming VoiceXML processor use a validatingparser.
AConforming VoiceXML Processor must be aConforming Speech Synthesis Markup Language Processor[SSML] and aConforming XMLGrammar Processor[SRGS]except for differences described in this document. If a syntaxerror is detected processing a grammar document, then an"error.badfetch" event must be thrown.
AConforming VoiceXML Processor must support thesyntax and semantics of all VoiceXML elements as described inthis document. Consequently, aConforming VoiceXMLProcessor must not throw an'error.unsupported.<element>' for any VoiceXML elementwhich must be supported when processing aConforming VoiceXMLDocument.
When aConforming VoiceXML Processor encounters aConforming VoiceXMLDocument with non-VoiceXML elements or attributes which areproprietary, defined only in earlier versions of VoiceXML, ordefined in a non-VoiceXML namespace, and which cannot beprocessed, then it must throw an "error.badfetch" event.
When aConforming VoiceXML Processor encounters adocument with a root element designating a namespace other thanVoiceXML, its behavior is undefined.
There is, however, no conformance requirement with respect toperformance characteristics of the VoiceXML Processor.
VoiceXML is an application of[XML] and thus supports[UNICODE] which defines a standard universalcharacter set.
Additionally, VoiceXML provides a mechanism for precisecontrol of the input and output languages via the use of"xml:lang" attribute. This facility provides:
Voice is central to, but not the limit of, VoiceXMLapplications. While speaking and listening will be the mostwidely used techniques in most circumstances and for most usersto interact with VoiceXML applications, some users may be unableto speak and/or listen because of temporary (or permanent)circumstances. Persons with disabilities, particularly those withspeech and/or hearing impairments, may need to interact withVoiceXML applications in other ways:
<audio src="greetings.wav">Greetings</audio>would normally replay the greetings.wav audio file. However, ifthe VoiceXML interpreter Context has detected that the user isviewing the interaction on a display or is touching Brailleoutput, then the text "Greetings" is rendered by the display orBraille output device.
Providing alternative paths to information delivery and userinput is central to all W3C technologies intended for use bypeople. While initially authored to make on screen contentaccessible, the following accessibility guidelines published byW3C's Web Accessibility Initiative (WAI) also apply toVoiceXML.
Additional guidelines for enabling persons with disabilitiesto access VoiceXML applications include the following:
A future version of VoiceXML may specify criteria by which aVoiceXML Processor safeguards the privacy of personal data.
The following is a summary of the differences between VoiceXML2.0 and VoiceXML 1.0[VOICEXML-1.0].
Developers of VoiceXML 1.0 applications should pay particularattentions to the changes incompatible with VoiceXML 1.0specified inObsoleteElements andIncompatibly Modified Elements.
Definition: A packaged application fragment designed tobe invoked by arbitrary applications or other Reusable DialogComponents. A Reusable Dialog Component (RDC) encapsulates thecode for an interaction with the caller.
Reusable dialog components provide pre-packaged functionality"out-of-the-box" that enables developers to quickly buildapplications by providing standard default settings and behavior.They shield developers from having to worry about many of theintricacies associated with building a robust speech dialog,e.g., confidence score interpretation, error recovery mechanisms,prompting, etc. This behavior can be customized by a developer ifnecessary to provide application-specific prompts, vocabulary,retry settings, etc.
In this version of VoiceXML, the only authentic reusablecomponent calling mechanisms are <subdialog> and<object>. Components called this way follow a model similarto subroutines in programming languages: the component isconfigured by a well-defined set of parameters passed to thecomponent, the component has a relatively constrained interactionwith the calling application, the component returns awell-defined result, and control returns automatically to thepoint from which the component was called. This has all thesignificant advantages of modularity, reentrancy, and easy reuseprovided by subroutines. Of the two kinds of components, only<subdialog> components are guaranteed to be as portable asVoiceXML itself. On the other hand, <object> components maybe able to package advanced, reusable functionality that has notyet been introduced into the standard.
Although reusable dialog components have the advantages ofmodularity, reentrancy, and easy reuse as described above, thedisadvantage of such components is that they must be designedvery carefully with an eye to reuse, and even with the mostcareful of designs it is possible that the application developerwill encounter situations for which the component cannot beeasily configured to handle the application requirements. Inaddition, while the constrained interaction of a component withits calling environment makes it possible for the componentdesigner to create a component that works predictably indisparate environments, it also may make the user's interactionwith the component seem disconnected from the rest of theapplication.
In such situations the application developer may wish to reuseVoiceXML source code in the form of samples and templates -samples designed for easy customizability. Such code is moreeasily tailored for and integrated into a particular application,at the expense of modularity and rentrancy.
Such templates and samples can be created by separatinginteresting VoiceXML code from a main dialog and thendistributing that code by copy for use in other dialogs. Thisform of reusability allows the user of the copied VoiceXML codeto modify it as necessary and continue to use their modifiedversion indefinitely.
VoiceXML facilitates this form of reusability by preservingthe separation of state between form elements. In this regard,VoiceXML and[HTML] aresimilar. An HTML table can be copied from one HTML page toanother because the table can be displayed regardless of thecontext before or after the table element.
Although parameterizability, modularity, and maintainabilitymay be sacrificed with this approach, it has the advantage ofbeing simple, quick, and eminently customizable.
This W3C specification is based upon VoiceXML 1.0 submitted bythe VoiceXML Forum in May 2000. The VoiceXML Forum authors were:Linda Boyer, IBM; Peter Danielsen, Lucent Technologies; JimFerrans, Motorola; Gerald Karam, AT&T; David Ladd, Motorola;Bruce Lucas, IBM; Kenneth Rehor, Lucent Technologies.
This version was written by the participants in the W3C VoiceBrowser Working Group. The following have significantlycontributed to writing this specification:
The Working Group would like to thank Dave Raggett and JimLarson for their invaluable management support.
The W3C Voice Browser Working Group has applied to IETF toregister a media type for VoiceXML. The requested media type isapplication/voicexml+xml.
The W3C Voice Browser Working Group has adopted the conventionof using the ".vxml" filename suffix for VoiceXML documents.
This section is Normative.
The XML Schema definition for VoiceXML is located athttp://www.w3.org/TR/voicexml20/vxml.xsd.
The VoiceXML schema depends upon other schemas defined in theVoiceXML namespace:
The complete set of Speech Interface Framework schema requiredfor VoiceXML 2.0 is availablehere.
The <field> type attribute inSection 2.3.1 is used to specify a builtingrammar for one of the fundamental types. Platform support forfundamental builtin grammars is optional. If a platform doessupport builtin types, then it must follow the description givenin this appendix as closely as possible, including all thebuiltins for a given language.
Each builtin type has a convention for the format of the valuereturned. These are independent of language and of theimplementation. The return type for builtin fields is a stringexcept for the boolean field type. To access the actualrecognition result, the author can reference the <field>shadow variablename$.utterance. Alternatively, thedeveloper can access application.lastresult$, whereapplication.lastresult$.interpretation has the same string valueas application.lastresult$.utterance.
The builtin types are defined in such a way that a VoiceXMLapplication developer can assume some consistency of user inputacross implementations. This permits help messages and otherprompts to be independent of platform in many instances. Forexample, the boolean type's grammar should minimally allow"yes" and "no" responses in English, but each implementation isfree to add other choices, such as "yeah" and "nope".
In cases where an application requires specific behavior ordifferent behavior than defined for a builtin, it should use anexplicit field grammar. The following are circumstances in whichan application must provide an explicit field grammar in order toensure portability of the application with a consistent userinterface:
A platform is not required to implement a grammar that acceptsall possible values that might be returned by a builtin. Forinstance, the currency builtin defines the return valueformatting for a very broad range of currencies ([ISO4217]). The platform isnot required to support spoken input that includes any of theworld's currencies since that can negatively impact recognitionaccuracy. Similarly, the number builtin can return positive ornegative floating point numbers but the grammar is not requiredto support all possible spoken floating point numbers.
Builtins are also limited in their ability to handleunderspecified spoken input. For instance, "20 peso" cannot beresolved to a specific[ISO4217] currency code because the "peso" isthe name of the currency of numerous nations. In such cases theplatform may return a specific currency code according to thelanguage or may omit the currency code.
All builtin types must support both voice and DTMF entry.
The set of accepted spoken input for each builtin type isplatform dependent and will vary by language.
The value returned by a builtin type can be read out using the<say-as> element. VoiceXML extends <say-as> in[SSML] by adding'interpret-as' values corresponding to each builtin type.These values take the form "vxml:<type>" wheretype is a builtin type. The precise rendering of builtintypes is platform-specific and will vary by language.
The builtin types are:
boolean | Inputs include affirmative andnegative phrases appropriate to the current language. DTMF 1 isaffirmative and 2 is negative. The result is ECMAScript true foraffirmative or false for negative. The value will be submitted asthe string "true" or the string "false". If the field value issubsequently used in <say-as> with the interpret-as value"vxml:boolean", it will be spoken as an affirmative or negativephrase appropriate to the current language. |
---|---|
date | Valid spoken inputs include phrasesthat specify a date, including a month day and year. DTMF inputsare: four digits for the year, followed by two digits for themonth, and two digits for the day. The result is a fixed-lengthdate string with format yyyymmdd, e.g. "20000704". If the year isnot specified, yyyy is returned as "????"; if the month is notspecified mm is returned as "??"; and if the day is not specifieddd is returned as "??". If the value is subsequently used in<say-as> with the interpret-as value "vxml:date", it willbe spoken as date phrase appropriate to the current language. |
digits | Valid spoken or DTMF inputs includeone or more digits, 0 through 9. The result is a string ofdigits. If the result is subsequently used in <say-as> withthe interpret-as value "vxml:digits", it will be spoken as asequence of digits appropriate to the current language. A usercan say for example "two one two seven", but not "twenty onehundred and twenty-seven". A platform may support constructssuch as "two double-five eight". |
currency | Valid spoken inputs include phrasesthat specify a currency amount. For DTMF input, the "*" key willact as the decimal point. The result is a string with the formatUUUmm.nn, where UUU is the three character currency indicatoraccording to ISO standard 4217[ISO4217], or mm.nn if the currency is notspoken by the user or if the currency cannot be reliablydetermined (e.g. "dollar" and "peso" are ambiguous). If the fieldis subsequently used in <say-as> with theinterpret-as value "vxml:currency", it will be spoken as acurrency amount appropriate to the current language. |
number | Valid spoken inputs include phrasesthat specify numbers, such as "one hundred twenty-three", or"five point three". Valid DTMF input includes positive numbersentered using digits and "*" to represent a decimal point. Theresult is a string of digits from 0 to 9 and may optionallyinclude a decimal point (".") and/or a plus or minus sign.ECMAScript automatically converts result strings to numericalvalues when used in numerical expressions. The result must notuse a leading zero (which would cause ECMAScript to interpret asan octal number). If the field is subsequently used in<say-as> with the interpret-as value "vxml:number", itwill be spoken as a number appropriate to the currentlanguage. |
phone | Valid spoken inputs include phrasesthat specify a phone number. DTMF asterisk "*" represents "x".The result is a string containing a telephone number consistingof a string of digits and optionally containing the character "x"to indicate a phone number with an extension. For North America,a result could be "8005551234x789". If the field is subsequentlyused in <say-as> with the interpret-as value "vxml:phone",it will be spoken as a phone number appropriate to the currentlanguage. |
time | Valid spoken inputs include phrasesthat specify a time, including hours and minutes. The result is afive character string in the format hhmmx, where x is one of "a"for AM, "p" for PM, "h" to indicate a time specified using 24hour clock, or "?" to indicate an ambiguous time. Input can bevia DTMF. Because there is no DTMF convention for specifyingAM/PM, in the case of DTMF input, the result will always end with"h" or "?". If the field is subsequently used in <say-as>with the interpret-as value "vxml:time", it will be spoken asa time appropriate to the current language. |
An example of a <field> element with a builtin grammartype:
<field name="lo_fat_meal" type="boolean"> <prompt> Do you want a low fat meal on this flight? </prompt> <help> Low fat means less than 10 grams of fat, and under 250 calories. </help> <filled> <prompt> I heard <emphasis><say-as interpret-as="vxml:boolean"> <value expr="lo_fat_meal"/></say-as></emphasis>. </prompt> </filled></field>
In this example, the boolean type indicates that inputs arevarious forms of true and false. The value actually put into thefield is either true or false. The field would be read out usingthe appropriate affirmative or negative response in prompts.
In the next example, digits indicates that input will bespoken or keyed digits. The result is stored as a string, andrendered as digits using the <say-as> with"vxml:digits" as the value for the interpret-as attribute, i.e., "one-two-three", not "one hundred twenty-three". The<filled> action tests the field to see if it has 12 digits.If not, the user hears the error message.
<field name="ticket_num" type="digits"> <prompt> Read the 12 digit number from your ticket. </prompt> <help>The 12 digit number is to the lower left.</help> <filled> <if cond="ticket_num.length != 12"> <prompt> Sorry, I didn't hear exactly 12 digits. </prompt> <assign name="ticket_num" expr="undefined"/> <else/> <prompt>I heard <say-as interpret-as="vxml:digits"> <value expr="ticket_num"/></say-as> </prompt> </if> </filled></field>
The builtin boolean grammar and builtin digits grammar can beparameterized. This is done by explicitly referring to builtingrammars using a platform-specific builtin URI scheme and using aURI-style query syntax of the formtype?param=value in the src attribute of a<grammar> element, or in the type attribute of a field, forexample:
<grammar src="builtin:dtmf/boolean?y=7;n=9"/><field type="boolean?y=7;n=9"> <prompt> If this is correct say yes or press seven, if not, say no or press nine. </prompt></field><field type="digits?minlength=3;maxlength=5"> <prompt>Please enter your passcode</prompt></field>
Where the <grammar> parameterizes the builtin DTMFgrammar, the first <field> parameterizes the builtin DTMFgrammar (the speech grammar will be activated as normal) and thesecond <field> parameterizes both builtin DTMF and speechgrammars. Parameters which are undefined for a given grammar typewill be ignored; for example, "builtin:grammar/boolean?y=7".
The digits and boolean grammars can be parameterized asfollows:
digits?minlength=n | A string of at leastn digits.Applicable to speech and DTMF grammars. If minlength conflictswith either the length or maxlength attributes then aerror.badfetch event is thrown. |
---|---|
digits?maxlength=n | A string of at mostn digits.Applicable to speech and DTMF grammars. If maxlength conflictswith either the length or minlength attributes then aerror.badfetch event is thrown. |
digits?length=n | A string of exactlyn digits.Applicable to speech and DTMF grammars. If length conflicts witheither the minlength or maxlength attributes then aerror.badfetch event is thrown. |
boolean?y=d | A grammar that treats the keypressd as an affirmative answer. Applicable only to the DTMFgrammar. |
boolean?n=d | A grammar that treats the keypressd as a negative answer. Applicable only to the DTMFgrammar. |
Note that more than one parameter may be specified separatedby ";" as illustrated above. When a <grammar> element withthe mode set to "voice" (the default value) is specified in a<field>, it is in addition to the default speech grammarmplied by the type attribute of the field. Likewise, when a<grammar> element with the mode set to "dtmf" is specifiedin a <field>, it is in addition to the default DTMFgrammar.