Please refer to theerrata for this document, which may include some normative corrections.
See alsotranslations.
Copyright © 2009W3C® (MIT,ERCIM,Keio), All Rights Reserved. W3Cliability,trademark anddocument use rules apply.
The W3C Multimodal Interaction Working Group aims to developspecifications to enable access to the Web using multimodalinteraction. This document is part of a set of specifications formultimodal systems, and provides details of an XML markup languagefor containing and annotating the interpretation of user input.Examples of interpretation of user input are a transcription intowords of a raw signal, for instance derived from speech, pen orkeystroke input, a set of attribute/value pairs describing theirmeaning, or a set of attribute/value pairs describing a gesture.The interpretation of the user's input is expected to be generatedby signal interpretation processes, such as speech and inkrecognition, semantic interpreters, and other types of processorsfor use by components that act on the user's inputs such asinteraction managers.
This section describes the status of this document at thetime of its publication. Other documents may supersede thisdocument. A list of current W3C publications and the latestrevision of this technical report can be found in theW3C technical reports index athttp://www.w3.org/TR/.
This is theRecommendationof "EMMA: Extensible MultiModal Annotation markup language".It has been produced by theMultimodal Interaction Working Group,which is part of theMultimodal Interaction Activity.
Comments are welcome onwww-multimodal@w3.org(archive).SeeW3C mailing list and archiveusage guidelines.
The design of EMMA has been widely reviewed(see thedisposition of comments)and satisfies the Working Group's technical requirements.A list of implementations is included in theEMMA Implementation Report.The Working Group made a few editorial changes to the15 December 2008 Proposed Recommendation.Changes from the Proposed Recommendation can be found inAppendix F.
This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.
This specification describes markup for representinginterpretations of user input (speech, keystrokes, pen input etc.)together with annotations for confidence scores, timestamps, inputmedium etc., and forms part of the proposals for theW3C Multimodal InteractionFramework.
This document was produced by a group operating under the5February 2004 W3C Patent Policy. W3C maintains apublic list of anypatent disclosures made in connection with the deliverables ofthe group; that page also includes instructions for disclosing apatent. An individual who has actual knowledge of a patent whichthe individual believes containsEssential Claim(s) must disclose the information in accordancewithsection 6 of the W3C Patent Policy.
The sections in the main body of this document are normative unlessotherwise specified. The appendices in this document are informativeunless otherwise indicated explicitly.
All sections in this specification are normative, unlessotherwise indicated. The informative parts of this specificationare identified by "Informative" labels within sections.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALLNOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"in this document are to be interpreted as described in [RFC2119].
emma:model
elementemma:derived-from
element andemma:derivation
elementemma:grammar
elementemma:info
elementemma:endpoint-info
element andemma:endpoint
elementemma:tokens
attributeemma:process
attributeemma:no-input
attributeemma:uninterpreted
attributeemma:lang
attributeemma:signal
andemma:signal-size
attributesemma:media-type
attributeemma:confidence
attributeemma:source
attributeemma:medium
,emma:mode
,emma:function
,emma:verbal
attributesemma:hook
attributeemma:cost
attributeemma:endpoint-role
,emma:endpoint-address
,emma:port-type
,emma:port-num
,emma:message-id
,emma:service-name
,emma:endpoint-pair-ref
,emma:endpoint-info-ref
attributesemma:grammar
element:emma:grammar-ref
attributeemma:model
element:emma:model-ref
attributeemma:dialog-turn
attributeemma:hook
and SRGS(Informative)This section isInformative.
This document presents an XML specification for EMMA, anExtensible MultiModal Annotation markup language, responding to therequirements documented inRequirements for EMMA[EMMARequirements]. Thismarkup language is intended for use by systems that providesemantic interpretations for a variety of inputs, including but notnecessarily limited to, speech, natural language text, GUI and inkinput.
It is expected that this markup will be used primarily as astandard data interchange format between the components of amultimodal system; in particular, it will normally be automaticallygenerated by interpretation components to represent the semanticsof users' inputs, not directly authored by developers.
The language is focused on annotating single inputs from users,which may be either from a single mode or a composite inputcombining information from multiple modes, as opposed toinformation that might have been collected over multiple turns of adialog. The language provides a set of elements and attributes thatare focused on enabling annotations on user inputs andinterpretations of those inputs.
An EMMA document can be considered to hold three types ofdata:
instance data
Application-specific markup corresponding to input informationwhich is meaningful to the consumer of an EMMA document. Instancesare application-specific and built by input processors at runtime.Given that utterances may be ambiguous with respect to inputvalues, an EMMA document may hold more than one instance.
data model
Constraints on structure and content of an instance. The datamodel is typically pre-established by an application, and may beimplicit, that is, unspecified.
metadata
Annotations associated with the data contained in the instance.Annotation values are added by input processors at runtime.
Given the assumptions above about the nature of data representedin an EMMA document, the following general principles apply to thedesign of EMMA:
emma:info
element(Section4.1.4).The annotations of EMMA should be considered 'normative' in thesense that if an EMMA component produces annotations as describedinSection 3andSection4, these annotations must be represented using the EMMAsyntax. The Multimodal Interaction Working Group may address inlater drafts the issues of modularization and profiling; that is,which sets of annotations are to be supported by which classes ofEMMA component.
The general purpose of EMMA is to represent informationautomatically extracted from a user's input by an interpretationcomponent, where input is to be taken in the general sense of ameaningful user input in any modality supported by the platform.The reader should refer to the sample architecture inW3CMultimodal Interaction Framework[MMIFramework], which shows EMMA conveying content betweenuser input modality components and an interaction manager.
Components that generate EMMA markup:
Components that use EMMA include:
Although not a primary goal of EMMA, a platform may also chooseto use this general format as the basis of a general semanticresult that is carried along and filled out during each stage ofprocessing. In addition, future systems may also potentially makeuse of this markup to convey abstract semantic content to berendered into natural language by a natural language generationcomponent.
emma:time-ref-uri
,emma:time-ref-anchor-point
allows you to specifywhether the referenced anchor is the start or end of theinterval.anyURI
primitive as defined in XML Schema Part 2:Datatypes Second Edition Section 3.2.17 [SCHEMA2].This section isInformative.
As noted above, the main components of an interpreted user inputin EMMA are the instance data, an optional data model, and themetadata annotations that may be applied to that input. Therealization of these components in EMMA is as follows:
An EMMAinterpretation is the primary unit for holdinguser input as interpreted by an EMMA processor. As will be seenbelow, multiple interpretations of a single input are possible.
EMMA provides a simple structural syntax for the organization ofinterpretations and instances, and an annotative syntax to applythe annotation to the input data at different levels.
An outline of the structural syntax and annotations found inEMMA documents is as follows. A fuller definition may be found inthe description of individual elements and attributes inSection 3 andSection 4.
emma:emma
element, holds EMMAversion and namespace information, and provides a container for oneor more of the following interpretation and container elements(Section 3.1)emma:interpretation
element contains a giveninterpretation of the input and holds application specific markup(Section 3.2)emma:one-of
is a container for one or moreinterpretation elements or container elements and denotes thatthese are mutually exclusive interpretations (Section 3.3.1)emma:group
is a general container for one or moreinterpretation elements or container elements. It can be associatedwith arbitrary grouping criteria (Section3.3.2).emma:sequence
is a container for one or moreinterpretation elements or container elements and denotes thatthese are sequential in time (Section3.3.3).emma:lattice
element is used to contain a series ofemma:arc
andemma:node
elements thatdefine a lattice of words, gestures, meanings or other symbols. Theemma:lattice
element appears within theemma:interpretation
element (Section3.4)emma:literal
element is used as a wrapper when theapplication semantics is a string literal. (Section3.5)emma:derived-from
,emma:endpoint-info
, andemma:info
whichare represented as elements so that they can occur more than oncewithin an element and can contain internal structure. (Section 4.1)emma:start
,emma:end
,emma:confidence
, andemma:tokens
which are represented as attributes. Theycan appear onemma:interpretation
elements.Some can appear on container elements, lattice elements, andelements in the application-specific markup. (Section 4.2)From the defined root nodeemma:emma
the structureof an EMMA document consists of a tree of EMMA container elements(emma:one-of
,emma:sequence
,emma:group
) terminating in a number of interpretationelements (emma:interpretation
). Theemma:interpretation
elements serve as wrappers foreither application namespace markup describing the interpretationof the users input or anemma:lattice
element oremma:literal
element . A singleemma:interpretation
may also appear directly under theroot node.
The EMMA elementsemma:emma
,emma:interpretation
,emma:one-of
,andemma:literal
and the EMMA attributesemma:no-input
,emma:uninterpreted
,emma:medium
,andemma:mode
are required of allimplementations. The remaining elements and attributes are optionaland may be used in some implementations and not other depending on thespecific modalities and processing being represented.
To illustrate this, here is an exampleofan EMMA documentrepresenting inputto a flight reservation application. In this example there are twospeech recognition results and associated semantic representationsof the input. The system is uncertain whether the user meant"flights from Boston to Denver" or "flights from Austin to Denver".The annotations to be captured are timestamps and confidence scoresfor the two inputs.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of emma:start="1087995961542" emma:end="1087995963542" emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:confidence="0.75" emma:tokens="flights from boston to denver"> <origin>Boston</origin> <destination>Denver</destination> </emma:interpretation> <emma:interpretation emma:confidence="0.68" emma:tokens="flights from austin to denver"> <origin>Austin</origin> <destination>Denver</destination> </emma:interpretation> </emma:one-of></emma:emma>
Attributes on the rootemma:emma
element indicatethe version and namespace. Theemma:emma
elementcontains anemma:one-of
element which contains adisjunctive list of possible interpretations of the input. Theactual semantic representation of each interpretation is within theapplication namespace. In the example here the application specificsemantics involves elementsorigin
anddestination
indicating the origin and destinationcities for looking up a flight. The timestamp is the same for bothinterpretations and it is annotated using values in milliseconds intheemma:start
andemma:end
attributes ontheemma:one-of
. The confidence scores and tokensassociated with each of the inputs are annotated using the EMMAannotation attributesemma:confidence
andemma:tokens
on each of theemma:interpretation
elements.
An EMMA data model expresses the constraints on the structureand content of instance data, for the purposes of validation. Assuch, the data model may be considered as a particular kind ofannotation (although, unlike other EMMA annotations, it is not afeature pertainingto a specific user input at aspecific moment in time, it is rather a static and, by its verydefinition, application-specific structure).Thespecification ofa data model in EMMA is optional.
Since Web applications today use different formats to specifydata models, e.g.XML Schema Part 1: Structures SecondEdition [XML SchemaStructures], XForms1.0 (SecondEdition) [XFORMS],RELAX NGSpecification [RELAX-NG], etc., EMMAitself is agnostic to the format of data model used.
Data model definition and reference is defined inSection 4.1.1.
An EMMA attribute is qualified with the EMMA namespace prefix ifthe attribute can also be used as an in-line annotation on elementsin the application's namespace. Most of the EMMA annotationattributes inSection 4.2 are in this category.An EMMA attribute is not qualified with the EMMA namespace prefixif the attribute only appears on an EMMA element. This rule ensuresconsistent usage of the attributes across all examples.
Attributes from other namespaces are permissible on all EMMAelements. As an examplexml:lang
may be used toannotate the human language of character data content.
This section defines elements in the EMMA namespace whichprovide the structural syntax of EMMA documents.
emma:emma
Annotation | emma:emma |
---|---|
Definition | The root element of an EMMA document. |
Children | Theemma:emma element MUST immediately contain asingleemma:interpretation element or EMMA containerelement:emma:one-of ,emma:group ,emma:sequence . It MAY also contain an optional singleemma:derivation element and an optional singleemma:info annotation element. It MAY also containmultiple optionalemma:grammar annotation elements,emma:model annotation elements, andemma:endpoint-info annotation elements. |
Attributes |
|
Applies to | None |
The root element of an EMMA document is namedemma:emma
. It holds a singleemma:interpretation
or EMMA container element(emma:one-of
,emma:sequence
,emma:group
). It MAY also contain a singleemma:derivation
element containing earlier stages ofthe processing of the input (SeeSection4.1.2). It MAY also contain an optional single annotationelement:emma:info
and multiple optionalemma:grammar
,emma:model
, andemma:endpoint-info
elements.
It MAY hold attributes for information pertaining to EMMAitself, along with any namespaces which are declared for the entiredocument, and any other EMMA annotative data. Theemma:emma
element and other elements and attributesdefined in this specification belong to the XML namespaceidentified by the URI "http://www.w3.org/2003/04/emma". In theexamples, the EMMA namespace is generally declared using theattributexmlns:emma
on the rootemma:emma
element. EMMA processors MUST support thefull range of ways of declaring XML namespaces as defined by theNamespaces in XML 1.1 (Second Edition) [XMLNS]. Application markup MAY be declared in anexplicit application namespace, or an undefined namespace(equivalent to setting xmlns="").
For example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> ....</emma:emma>
or
<emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> ....</emma>
emma:interpretation
Annotation | emma:interpretation |
---|---|
Definition | Theemma:interpretation element acts as a wrapperfor application instance data or lattices. |
Children | Theemma:interpretation element MUST immediatelycontain either application instance data, or a singleemma:lattice element, or a singleemma:literal element, or in the case of uninterpretedinput or no inputemma:interpretation MUST be empty. It MAY also containmultipleoptionalemma:derived-from elements andan optional singleemma:info element. |
Attributes |
|
Applies to | Theemma:interpretation element is legal only as achild ofemma:emma ,emma:group ,emma:one-of ,emma:sequence , oremma:derivation . |
Theemma:interpretation
element holds a singleinterpretation represented in application specific markup, or asingleemma:lattice
element, or a singleemma:literal
element.
Theemma:interpretation
element MUST be empty if itis marked withemma:no-input="true"
(Section 4.2.3). Theemma:interpretation
elementMUST be emptyif it has been annotated withemma:uninterpreted="true"
(Section 4.2.4) oremma:function="recording"
(Section 4.2.11).
Attributes:
xsd:ID
value that uniquelyidentifies the interpretation within the EMMA document.<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="acoustic" emma:mode="voice"> ... </emma:interpretation></emma:emma>
Whileemma:medium
andemma:mode
areoptional onemma:interpretation
, note that all EMMAinterpretations must be annotated foremma:medium
andemma:mode
, so either these attributes must appeardirectly onemma:interpretation
or they must appear onan ancestoremma:one-of
node or they must appear on anearlier stage of the derivation listed inemma:derivation
.
emma:one-of
elementAnnotation | emma:one-of |
---|---|
Definition | A container element indicating a disjunction among a collectionof mutually exclusive interpretations of the input. |
Children | Theemma:one-of element MUST immediately contain acollection of one or moreemma:interpretation elementsor container elements:emma:one-of ,emma:group ,emma:sequence . It MAY alsocontainmultiple optionalemma:derived-from elements andanoptional singleemma:info element. |
Attributes |
|
Applies to | Theemma:one-of element MAY only appear as a childofemma:emma ,emma:one-of ,emma:group ,emma:sequence , oremma:derivation . |
Theemma:one-of
element acts as a container for acollection of one or more interpretation(emma:interpretation
) or container elements(emma:one-of
,emma:group
,emma:sequence
), and denotes that these are mutuallyexclusive interpretations.
An N-best list of choices in EMMA MUST be represented as a setofemma:interpretation
elements contained within anemma:one-of
element. For instance, a series ofdifferent recognition results in speech recognition might berepresented in this way.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-ofemma:medium="acoustic" emma:mode="voice"> <emma:interpretation> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of></emma:emma>
The function of theemma:one-of
element is torepresent a disjunctive list of possible interpretations of a userinput. A disjunction of possible interpretations of an input can bethe result of different kinds of processing or ambiguity. Onesource is multiple results from a recognition technology such asspeech or handwriting recognition. Multiple results can also occurfrom parsing or understanding natural language. Another possiblesource of ambiguity is from the application of multiple differentkinds of recognition or understanding components to the same inputsignal. For example, an single ink input signal might be processedby both handwriting recognition and gesture recognition. Another isthe use of more than one recording device for the same input(multiple microphones).
In order to make explicit these different kinds of multipleinterpretations and allow for concise statement of the annotationsassociated with each, theemma:one-of
element MAYappear within anotheremma:one-of
element. Ifemma:one-of
elements are nested then they MUSTindicate the kind of disjunction using the attributedisjunction-type
. The values ofdisjunction-type
are{recognition,understanding, multi-device, and multi-process}
. For themost common use case, where there are multiple recognition resultsand some of them have multiple interpretations, the top-levelemma:one-of
isdisjunction-type="recognition"
and the embeddedemma:one-of
has the attributedisjunction-type="understanding"
.
As an example, in an interactive flight reservation application,recognition yielded 'Boston' or 'Austin' and each had a semanticinterpretation as either the assertion of city name or thespecification of a flight query with the city as the destination,this would be represented as follows in EMMA:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of disjunction-type="recognition" start="12457990" end="12457995"emma:medium="acoustic" emma:mode="voice"> <emma:one-of disjunction-type="understanding" emma:tokens="boston"> <emma:interpretation> <assert><city>boston</city></assert> </emma:interpretation> <emma:interpretation> <flight><dest><city>boston</city></dest></flight> </emma:interpretation> </emma:one-of> <emma:one-of disjunction-type="understanding" emma:tokens="austin"> <emma:interpretation> <assert><city>austin</city></assert> </emma:interpretation> <emma:interpretation> <flight><dest><city>austin</city></dest></flight> </emma:interpretation> </emma:one-of> </emma:one-of></emma:emma>
EMMA MAY explicitly represent ambiguity resulting from differentprocesses, devices, or sources using embeddedemma:one-of
and thedisjunction-type
attribute. Multiple different interpretations resulting fromdifferent factors MAY also be listed within a single unstructuredemma:one-of
though in this case it is more complex orimpossible to uncover the sources of the ambiguity if required bylater stages of processing. If there is no embedding inemma:one-of
, then thedisjunction-type
attribute is not required. If thedisjunction-type
attribute is missing then by default the source of disjunction isunspecified.
The example case above could also be represented as:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of start="12457990" end="12457995" emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:tokens="boston"> <assert><city>boston</city></assert> </emma:interpretation> <emma:interpretation > <flight><dest><city>boston</city></dest></flight> </emma:interpretation> <emma:interpretation emma:tokens="austin"> <assert><city>austin</city></assert> </emma:interpretation> <emma:interpretation emma:tokens="austin"> <flight><dest><city>austin</city></dest></flight> </emma:interpretation> </emma:one-of></emma:emma>
But in this case information about which interpretationsresulted from speech recognition and which resulted from languageunderstanding is lost.
A list ofemma:interpretation
elements within anemma:one-of
MUST be sorted best-first by some measureof quality. The quality measure isemma:confidence
ifpresent, otherwise, the quality metric is platform-specific.
With embeddedemma:one-of
structures there is norequirement for the confidence scores within differentemma:one-of
to be on the same scale. For example, thescores assigned by handwriting recognition might not be comparableto those assigned by gesture recognition. Similarly, if multiplerecognizers are used there is no guarantee that their confidencescores will be comparable. For this reason the ordering requirementonemma:interpretation
withinemma:one-of
only applies locally to sisteremma:interpretation
elements within eachemma:one-of
. There is norequirement on the ordering of embeddedemma:one-of
elements within a higheremma:one-of
element.
Whileemma:medium
andemma:mode
areoptional onemma:one-of
, note that all EMMAinterpretations must be annotated foremma:medium
andemma:mode
, so either these annotations must appeardirectly on all of the containedemma:interpretation
elements within theemma:one-of
, or they must appearon theemma:one-of
element itself, or they must appearon an ancestoremma:one-of
element, or they mustappear on an earlier stage of the derivation listed inemma:derivation
.
emma:group
elementAnnotation | emma:group |
---|---|
Definition | A container element indicating that a number of interpretationsof distinct user inputs are grouped according to somecriteria. |
Children | Theemma:group element MUST immediately contain acollection of one or moreemma:interpretation elementsor container elements:emma:one-of ,emma:group ,emma:sequence . It MAY alsocontain anoptional singleemma:group-info element. It MAY also containmultiple optionalemma:derived-from elements andan optional singleemma:info element. |
Attributes |
|
Applies to | Theemma:group element is legal only as a child ofemma:emma ,emma:one-of ,emma:group ,emma:sequence , oremma:derivation . |
Theemma:group
element is used to indicate that thecontained interpretations are from distinct user inputs that arerelated in some manner.emma:group
MUST NOT be usedfor containing the multiple stages of processing of a single userinput. Those MUST be contained in theemma:derivation
element instead(Section 4.1.2).For groups of inputs in temporal order the more specializedcontaineremma:sequence
MUST be used(Section 3.3.3). The following example showsthree interpretations derived from the speech input "Move thisambulance here" and the tactile input related to two consecutivepoints on a map.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:group emma:start="1087995961542" emma:end="1087995964542"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <action>move</action> <object>ambulance</object> <destination>here</destination> </emma:interpretation> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group></emma:emma>
Theemma:one-of
andemma:group
containers MAY be nested arbitrarily.
emma:group-info
elementAnnotation | emma:group-info |
---|---|
Definition | Theemma:group-info element contains or referencescriteria used in establishing the grouping of interpretations in anemma:group element. |
Children | Theemma:group-info element MUST eitherimmediately contain inline instance data specifying groupingcriteria or have the attributeref referencing thecriteria. |
Attributes |
|
Applies to | Theemma:group-info element is legal only as achild ofemma:group . |
Sometimes it may be convenient to indirectly associate a givengroup with information, such as grouping criteria. Theemma:group-info
element might be used to make explicitthe criteria by which members of a group are associated. In thefollowing example, a group of two points is associated with adescription of grouping criteria based upon a sliding temporalwindow of two seconds duration.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example" xmlns:ex="http://www.example.com/ns/group"> <emma:group> <emma:group-info> <ex:mode>temporal</ex:mode> <ex:duration>2s</ex:duration> </emma:group-info> <emma:interpretation emma:medium="tactile" emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group></emma:emma>
You might also useemma:group-info
to refer to anamed grouping criterion using external reference, forinstance:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example" xmlns:ex="http://www.example.com/ns/group"> <emma:group> <emma:group-info ref="http://www.example.com/criterion42"/> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group></emma:emma>
emma:sequence
elementAnnotation | emma:sequence |
---|---|
Definition | A container element indicating that a number of interpretationsof distinct user inputs are in temporal sequence. |
Children | Theemma:sequence element MUST immediately containa collection of one or moreemma:interpretation elements or container elements:emma:one-of ,emma:group ,emma:sequence . It MAY alsocontainmultiple optionalemma:derived-from elements andanoptional singleemma:info element. |
Attributes |
|
Applies to | Theemma:sequence element is legal only as a childofemma:emma ,emma:one-of ,emma:group ,emma:sequence , oremma:derivation . |
Theemma:sequence
element is used to indicate thatthe contained interpretations are sequential in time, as in thefollowing example, which indicates that two points made with a penare in temporal order.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:sequence> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretationemma:medium="tactile" emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:sequence></emma:emma>
Theemma:sequence
container MAY be combined withemma:one-of
andemma:group
in arbitrarynesting structures. The order of children in the content of theemma:sequence
element corresponds to a sequence ofinterpretations. This ordering does not imply any particulardefinition of sequentiality. EMMA processors are expected thereforeto use theemma:sequence
element to holdinterpretations which are either strictly sequential in nature(e.g. the end-time of an interpretation precedes the start-time ofits follower), or which overlap in some manner (e.g. the start-timeof a follower interpretation precedes the end-time of itsprecedent). It is possible to use timestamps to provide finegrained annotation for the sequence of interpretations that aresequential in time(seeSection4.2.10).
In the following more complex example, a sequence of two pengestures inemma:sequence
and a speech input inemma:interpretation
is contained in anemma:group
.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:group> <emma:interpretation emma:medium="acoustic" emma:mode="voice"> <action>move</action> <object>this-battleship</object> <destination>here</destination> </emma:interpretation> <emma:sequence> <emma:interpretation emma:medium="tactile" emma:mode="ink"> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretation emma:medium="tactile" emma:mode="ink"> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:sequence> </emma:group></emma:emma>
In addition to providing the ability to represent N-best listsof interpretations usingemma:one-of
, EMMA alsoprovides the capability to represent lattices of words or othersymbols using theemma:lattice
element. Latticesprovide a compact representation of large lists of possiblerecognition results or interpretations for speech, pen, ormultimodal inputs.
In addition to providing a representation for lattice outputfrom speech recognition, another important use case for lattices isfor representation of the results of gesture and handwritingrecognition from a pen modality component. Lattices can also beused to compactly represent multiple possible meaningrepresentations. Another use case for the lattice representation isfor associating confidence scores and other annotations withindividual words within a speech recognition result string.
Lattices are compactly described by a list of transitionsbetween nodes. For each transition the start and end nodes MUST bedefined, along with the label for the transition. Initial and finalnodes MUST also be indicated. The following figure provides agraphical representation of a speech recognition lattice whichcompactly represents eight different sequences of words.
which expands to:
a. flights to boston from portland today pleaseb. flights to austin from portland today pleasec. flights to boston from oakland today pleased. flights to austin from oakland today pleasee. flights to boston from portland tomorrowf. flights to austin from portland tomorrowg. flights to boston from oakland tomorrowh. flights to austin from oakland tomorrow
emma:lattice
,emma:arc
,emma:node
elementsAnnotation | emma:lattice |
---|---|
Definition | An element which encodes a lattice representation of userinput. |
Children | Theemma:lattice element MUST immediately containone or moreemma:arc elements and zero or moreemma:node elements. |
Attributes |
|
Applies to | Theemma:lattice element is legal only as a childof theemma:interpretation element. |
Annotation | emma:arc |
Definition | An element which encodes a transition between two nodes in alattice. The label associated with the arc in the lattice isrepresented in the content ofemma:arc . |
Children | Theemma:arc element MUST immediately containeither character data or a single application namespace element orbe empty, in the case of epsilon transitions. It MAY contain anemma:info element containing application or vendorspecific annotations. |
Attributes |
|
Applies to | Theemma:arc element is legal only as a child oftheemma:lattice element. |
Annotation | emma:node |
Definition | An element which represents a node in the lattice. Theemma:node elements are not required to describe alattice but might be added to provide a location for annotations onnodes in a lattice. There MUST be at most oneemma:node specification for each numbered node in thelattice. |
Children | An OPTIONALemma:info element for application orvendor specific annotations on the node. |
Attributes |
|
Applies to | Theemma:node element is legal only as a child oftheemma:lattice element. |
In EMMA, a lattice is represented using an elementemma:lattice
, which has attributesinitial
andfinal
for indicating theinitial and final nodes of the lattice. For the latticebelow, this will be:<emma:latticeinitial="1" final="8"/>
. The nodes are numbered withintegers. If there is more than one distinct final node in thelattice the nodes MUST be represented as a space separated list inthe value of thefinal
attribute e.g.<emma:lattice initial="1" final="9 10 23"/>
.There MUST only be one initial node in an EMMA lattice. Eachtransition in the lattice is represented as an elementemma:arc
with attributesfrom
andto
which indicate the nodes where the transitionstarts and ends. The arc's label is represented as the content oftheemma:arc
element and MUST be any well-formedcharacter or XML content. In the example here the contents arewords. Empty (epsilon) transitions in a lattice MUST be representedin theemma:lattice
representation asemma:arc
empty elements, e.g.<emma:arc from="1" to="8"/>
.
The example speech lattice above would be represented in EMMAmarkup as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <emma:lattice initial="1" final="8"> <emma:arc from="1" to="2">flights</emma:arc> <emma:arc from="2" to="3">to</emma:arc> <emma:arc from="3" to="4">boston</emma:arc> <emma:arc from="3" to="4">austin</emma:arc> <emma:arc from="4" to="5">from</emma:arc> <emma:arc from="5" to="6">portland</emma:arc> <emma:arc from="5" to="6">oakland</emma:arc> <emma:arc from="6" to="7">today</emma:arc> <emma:arc from="7" to="8">please</emma:arc> <emma:arc from="6" to="8">tomorrow</emma:arc> </emma:lattice> </emma:interpretation></emma:emma>
Alternatively, if we wish to represent the same information asan N-best list usingemma:one-of,
we would have themore verbose representation:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-ofemma:medium="acoustic" emma:mode="voice"> <emma:interpretation> <text>flights to boston from portland today please</text> </emma:interpretation> <emma:interpretationid="interp2"> <text>flights to boston from portland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from portland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from portland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to boston from oakland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to boston from oakland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from oakland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from oakland tomorrow</text> </emma:interpretation> </emma:one-of></emma:emma>
The lattice representation avoids the need to enumerate all ofthe possible word sequences. Also, as detailed below, theemma:lattice
representation enables placement ofannotations on individual words in the input.
For use cases involving the representation of gesture/inklattices and use cases involving lattices of semanticinterpretations, EMMA allows for application namespace elements toappear withinemma:arc
.
For example a sequence of two gestures, each of which isrecognized as either a line or a circle, might berepresented as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <emma:lattice initial="1" final="3"> <emma:arc from="1" to="2"> <circle radius="100"/> </emma:arc> <emma:arc from="2" to="3"> <line length="628"/> </emma:arc> <emma:arc from="1" to="2"> <circle radius="200"/> </emma:arc> <emma:arc from="2" to="3"> <line length="1256"/> </emma:arc> </emma:lattice> </emma:interpretation></emma:emma>
As an example of a lattice of semantic interpretations, in atravel application where the source is either "Boston" or"Austin"and the destination is either "Newark" or "New York", thepossibilities might be represented in a lattice as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <emma:lattice initial="1" final="3"> <emma:arc from="1" to="2"> <source city="boston"/> </emma:arc> <emma:arc from="2" to="3"> <destination city="newark"/> </emma:arc> <emma:arc from="1" to="2"> <source city="austin"/> </emma:arc> <emma:arc from="2" to="3"> <destination city="new york"/> </emma:arc> </emma:lattice> </emma:interpretation></emma:emma>
Theemma:arc
element MAY contain either anapplication namespace element or character data. It MUST NOTcontain combinations of application namespace elements andcharacter data. However, anemma:info
element MAYappear within anemma:arc
element alongside characterdata, in order to allow for the association of vendor orapplication specific annotations on a single word or symbol in alattice.
So, in summary, there are four groupings of content that canappear withinemma:arc
:
emma:info
elementproviding vendor or application specific annotations that apply tothe character data.emma:info
element providing vendor or applicationspecific annotations that apply to the character data.The encoding of lattice arcs as XML elements(emma:arc
) enables arcs to be annotated with metadatasuch as timestamps, costs, or confidence scores:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <emma:lattice initial="1" final="8"> <emma:arc from="1" to="2" emma:start="1087995961542" emma:end="1087995962042" emma:cost="30"> flights </emma:arc> <emma:arc from="2" to="3" emma:start="1087995962042" emma:end="1087995962542" emma:cost="20"> to </emma:arc> <emma:arc from="3" to="4" emma:start="1087995962542" emma:end="1087995963042" emma:cost="50"> boston </emma:arc> <emma:arc from="3" to="4" emma:start="1087995963042" emma:end="1087995963742" emma:cost="60"> austin </emma:arc> ... </emma:lattice> </emma:interpretation></emma:emma>
The following EMMA attributes MAY be placed onemma:arc
elements: absolute timestamps(emma:start
,emma:end
), relativetimestamps (emma:offset-to-start
,emma:duration
),emma:confidence
,emma:cost
, the human language of the input(emma:lang
),emma:medium
,emma:mode
, andemma:source
. The use caseforemma:medium
,emma:mode
, andemma:source
is for lattices which contains contentfrom different input modes. Theemma:arc
element MAYalso contain anemma:info
element for specification ofvendor and application specific annotations on the arc.
The timestamps that appear onemma:arc
elements donot necessarily indicate the start and end of the arc itself. TheyMAY indicate the start and end of the signal corresponding to thelabel on the arc. As a result there is no requirement that theemma:end
timestamp on an arc going into a node shouldbe equivalent to theemma:start
of all arcs going outof that node. Furthermore there is no guarantee that the left toright order of arcs in a lattice will correspond to the temporalorder of the input signal. The lattice representation is anabstraction that represents a range of possible interpretations ofa user's input and is not intended to necessarily be arepresentation of temporal order.
Costs are typically application and device dependent. There area variety of ways that individual arc costs might be combined toproduce costs for specific paths through the lattice. Thisspecification does not standardize the way for these costs to becombined; it is up to the applications and devices to determine howsuch derived costs would be computed and used.
For some lattice formats, it is also desirable to annotate thenodes in the lattice themselves with information such as costs. Forexample in speech recognition, costs might be placed on nodes as aresult of word penalties or redistribution of costs. For thispurpose EMMA also provides anemma:node
element whichcan host annotations such asemma:cost
. Theemma:node
element MUST have an attributenode-number
which indicates the number of the node.There MUST be at most oneemma:node
specification fora given numbered node in the lattice. In our example, if there wasa cost of100 on the final state this could be representedas follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <emma:lattice initial="1" final="8"> <emma:arc from="1" to="2" emma:start="1087995961542" emma:end="1087995962042" emma:cost="30"> flights </emma:arc> <emma:arc from="2" to="3" emma:start="1087995962042" emma:end="1087995962542" emma:cost="20"> to </emma:arc> <emma:arc from="3" to="4" emma:start="1087995962542" emma:end="1087995963042" emma:cost="50"> boston </emma:arc> <emma:arc from="3" to="4" emma:start="1087995963042" emma:end="1087995963742" emma:cost="60"> austin </emma:arc> ... <emma:node node-number="8" emma:cost="100"/> </emma:lattice> </emma:interpretation></emma:emma>
The relative timestamp mechanism in EMMA is intended to providetemporal information about arcs in a lattice in relative termsusing offsets in milliseconds. In order to do this the absolutetime MAY be specified onemma:interpretation
; bothemma:time-ref-uri
andemma:time-ref-anchor-point
apply toemma:lattice
and MAY be used there to set the anchorpoint for offsets to the start of the absolute time specified onemma:interpretation
. The offset in milliseconds to thebeginning of each arc MAY then be indicated on eachemma:arc
in theemma:offset-to-start
attribute.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:start="1087995961542" emma:end="1087995963042"emma:medium="acoustic" emma:mode="voice"> <emma:lattice emma:time-ref-uri="#interp1" emma:time-ref-anchor-point="start" initial="1" final="4"> <emma:arc from="1" to="2" emma:offset-to-start="0"> flights </emma:arc> <emma:arc from="2" to="3" emma:offset-to-start="500"> to </emma:arc> <emma:arc from="3" to="4" emma:offset-to-start="1000"> boston </emma:arc> </emma:lattice> </emma:interpretation></emma:emma>
Note that the offset for the firstemma:arc
MUSTalways be zero since the EMMA attributeemma:offset-to-start
indicates the number ofmilliseconds from the anchor point to thestart of the pieceof input associated with theemma:arc
, in this casethe word "flights".
emma:literal
elementAnnotation | emma:literal |
---|---|
Definition | An element that contains string literal output. |
Children | String literal |
Attributes | None. |
Applies to | Theemma:literal is a child ofemma:interpretation . |
Certain EMMA processing components produce semantic results inthe form of string literals without any surrounding applicationnamespace markup. These MUST be placed with the EMMA elementemma:literal
withinemma:interpretation
.For example, if a semantic interpreter simply returned "boston"this could be represented in EMMA as:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationid="r1"
emma:medium="acoustic" emma:mode="voice"> <emma:literal>boston</emma:literal> </emma:interpretation></emma:emma>
Note that a raw recognition result of a sequence of words fromspeech recognition is also a kind of string literal and can becontained withinemma:literal
. For example,recognition of the string "flights to san francisco" can berepresented in EMMA as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationid="r1"
emma:medium="acoustic" emma:mode="voice"> <emma:literal>flights to san francisco</emma:literal> </emma:interpretation></emma:emma>
This section defines annotations in the EMMA namespace includingboth attributes and elements. The values are specified in terms ofthe data types defined by XML Schema Part 2: DatatypesSecondEdition [XML SchemaDatatypes].
emma:model
elementAnnotation | emma:model |
---|---|
Definition | Theemma:model either references or providesinline the data model for the instance data. |
Children | If aref attribute is not specified then thiselement contains the data model inline. |
Attributes |
|
Applies to | Theemma:model element MAY appear only as a childofemma:emma . |
The data model that may be used to express constraints on thestructure and content of instance data is specified as one of theannotations of the instance. Specifying the data model is OPTIONAL,in which case the data model can be said to be implicit. Typicallythe data model is pre-established by the application.
The data model is specified with theemma:model
annotation defined as an element in the EMMA namespace. If the datamodel for the contents of aemma:interpretation
,container elements, or application namespace element is to bespecified in EMMA, the attributeemma:model-ref
MUSTbe specified on theemma:interpretation
, containerelement, or application namespace element. Note that since multipleemma:model
elements might be specified under theemma:emma
it is possible to refer to multiple datamodels within a single EMMA document. For example, differentalternative interpretations under anemma:one-of
mighthave different data models. In this case, anemma:model-ref
attribute would appear on eachemma:interpretation
element in the N-best list withits value being theid
of theemma:model
element for that particular interpretation.
The data model is closely related to the interpretation data,and is typically specified as the annotation related to theemma:interpretation
oremma:one-of
elements.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:model ref="http://example.com/models/city.xml"/> <emma:interpretation emma:model-ref="model1"emma:medium="acoustic" emma:mode="voice"> <city> London </city> <country> UK </country> </emma:interpretation></emma:emma>
Theemma:model
annotation MAY reference any elementor attribute in the application instance data, as well as any EMMAcontainer element (emma:one-of
,emma:group
, oremma:sequence
).
The data model annotation MAY be used to either reference anexternal data model with theref
attribute or providea data model as in-line content. Either aref
attribute or in-line data model (but not both) MUST bespecified.
emma:derived-from
element andemma:derivation
elementAnnotation | emma:derived-from |
---|---|
Definition | An empty element which provides a reference to theinterpretation which the element it appears on was derivedfrom. |
Children | None |
Attributes |
|
Applies to | Theemma:derived-from element is legal only as achild ofemma:interpretation ,emma:one-of ,emma:group , oremma:sequence . |
Annotation | emma:derivation |
Definition | An element which contains interpretation and container elementsrepresenting earlier stages in the processing of the input. |
Children | One or moreemma:interpretation ,emma:one-of ,emma:sequence , oremma:group elements. |
Attributes | None |
Applies to | Theemma:derivation MAY appear only as a child oftheemma:emma element. |
Instances of interpretations are in general derived from otherinstances of interpretation in a process that goes from raw data toincreasingly refined representations of the input. The derivationannotation is used to link any two interpretations that are relatedby representing the source and the outcome of an interpretationprocess. For instance, a speech recognition process can return thefollowing result in the form of raw text:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation
emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation></emma:emma>
A first interpretation process will produce:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation
emma:medium="acoustic" emma:mode="voice"> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation></emma:emma>
A second interpretation process, aware of the current date, willbe able to produce a more refined instance, such as:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <origin>Boston</origin> <destination>Denver</destination> <date>20030315</date> </emma:interpretation></emma:emma>
The interaction manager might need to have access to the threelevels of interpretation. Theemma:derived-from
annotation element can be used to establish a chain of derivationrelationships as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation
emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:derivation> <emma:interpretation> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>20030315</date> </emma:interpretation></emma:emma>
Theemma:derivation
element MAY be used as acontainer for representations of the earlier stages in theinterpretation of the input. The latest stage of processing MUST bea direct child ofemma:emma
.
The resource attribute onemma:derived-from
is aURI which can reference IDs in the current or other EMMAdocuments.
In addition to representing sequential derivations, the EMMAemma:derived-from
element can also be used to capturecomposite derivations. Composite derivations involve combination ofinputs from different modes.
In order to indicate whether anemma:derived-from
element describes a sequential derivation step or a compositederivation step, theemma:derived-from
element has anattributecomposite
which has a boolean value. Acompositeemma:derived-from
MUST be marked ascomposite="true"
while a sequentialemma:derived-from
element is marked ascomposite="false"
. If this attribute is not specifiedthe value isfalse
by default.
In the following composite derivation example the user said"destination" using the voice mode and circled Boston on a mapusing the ink mode:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation emma:start="1087995961500" emma:end="1087995962542" emma:process="http://example.com/myasr.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:tokens="destination"> <rawinput>destination</rawinput> </emma:interpretation> <emma:interpretation emma:start="1087995961600" emma:end="1087995964000" emma:process="http://example.com/mygesturereco.xml" emma:source="http://example.com/pen/wacom123" emma:signal="http://example.com/signals/ink5.inkml" emma:confidence="0.5" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="false"> <rawinput>Boston</rawinput> </emma:interpretation> </emma:derivation> <emma:interpretation emma:confidence="0.3"emma:start="1087995961500"emma:end="1087995964000" emma:medium="acoustic tactile" emma:mode="voice ink" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:tokens="destination"> <emma:derived-from resource="#voice1" composite="true" <emma:derived-from resource="#ink1" composite="true" <destination>Boston</destination> </emma:interpretation></emma:emma>
In this example, annotations on the multimodal interpretationindicate the process used for the integration and there are twoemma:derived-from
elements, one pointing to the speechand one pointing to the pen gesture.
The only constraints the EMMA specification places on theannotations that appear on a composite input are that theemma:medium
attribute MUST contain the union of theemma:medium
attributes on the combining inputs,represented as a space delimited set ofnmtokens
asdefined inSection 4.2.11, and that theemma:mode
attribute MUST contain the union of theemma:mode
attributes on the combining inputs,represented as a space delimited set ofnmtokens
as defined inSection 4.2.11. In theexample above this meanings that theemma:medium
valueis"acoustic tactile"
and theemma:mode
attribute is"voice ink"
. How all other annotationsare handled is author defined. In the following paragraph,informative examples on how specific annotations might be handledare given.
With reference to the illustrative example above, this paragraphprovides informative guidance regarding the determination ofannotations (beyondemma:medium
andemma:mode
on a composite multimodal interpretation).Generally the timestamp on a combined input should contain theintervals indicated by the combining inputs. For the absolutetimestampsemma:start
andemma:end
thiscan be achieved by taking the earlier of theemma:start
values(emma:start="1087995961500"
in our example) and thelater of theemma:end
values(emma:end="1087995964000"
in the example). Thedetermination of relative timestamps for composite is more complex,informative guidance is given inSection4.2.10.4. Generally speaking theemma:confidence
value will be some numerical combination of the confidence scoresassigned to the combining inputs. In our example, it is the resultof multiplying the voice and ink confidence scores(0.3
). In other cases there may not be a confidencescore for one of the combining inputs and the author may choose tocopy the confidence score from the input which does have one.Generally, foremma:verbal
, if either of the inputshas the valuetrue
then the multimodal interpretationwill also beemma:verbal="true"
as in the example. Inother words the annotation for the composite input is the result ofan inclusive OR of the boolean values of the annotations on theinputs. If an annotation is only specified on one of the combininginputs then it may in some cases be assumed to apply to themultimodal interpretation of the composite input. In the example,emma:lang="en-US"
is only specified for the speechinput, and this annotation appears on the composite result also.Similarly in our example, only the voice hasemma:tokens
and the author has chosen to annotate thecombined input with the sameemma:tokens
value. Inthis example, theemma:function
is the same on bothcombining input and the author has chosen to use the sameannotation on the composite interpretation.
In annotating derivations of the processing of the input, EMMAprovides the flexibility of both course-grained or fine-grainedannotation of relations among interpretations. For example, whenrelating two N-best lists, withinemma:one-of
elementseither there can be a singleemma:derived-from
elementunderemma:one-of
referring to the ID of theemma:one-of
for the earlier processing stage:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:one-ofemma:medium="acoustic" emma:mode="voice"> <emma:interpretation> <res>from boston to denver on march eleven two thousand three</res> </emma:interpretation> <emma:interpretation> <res>from austin to denver on march eleven two thousand three</res> </emma:interpretation> </emma:one-of></emma:derivation><emma:one-of> <emma:derived-from resource="#nbest1" composite="false"/> <emma:interpretation> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation></emma:one-of> </emma:emma>
Or there can be a separateemma:derived-from
element on eachemma:interpretation
element referringto the specificemma:interpretation
element it wasderived from.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of> <emma:interpretation> <emma:derived-from resource="#int1" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation> <emma:derived-from resource="#int2" composite="false"/> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of> <emma:derivation> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation> <res>from boston to denver on march eleven two thousand three</res> </emma:interpretation> <emma:interpretation> <res>from austin to denver on march eleven two thousand three</res> </emma:interpretation> </emma:one-of> </emma:derivation></emma:emma>
Section 4.3 provides further examples of theuse ofemma:derived-from
to represent sequentialderivations and addresses the issue of the scope of EMMAannotations across derivations of user input.
emma:grammar
elementAnnotation | emma:grammar |
---|---|
Definition | An element used to provide a reference to the grammar used inprocessing the input. |
Children | None |
Attributes |
|
Applies to | Theemma:grammar is legal only as a child of theemma:emma element. |
The grammar that was used to derive the EMMA result MAY bespecified with theemma:grammar
annotation defined asan element in the EMMA namespace.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:grammarref="someURI"/> <emma:grammarref="anotherURI"/> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:grammar-ref="gram1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation emma:grammar-ref="gram1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation emma:grammar-ref="gram2"> <command>help</command> </emma:interpretation> </emma:one-of></emma:emma>
Theemma:grammar
annotation is a child ofemma:emma.
emma:info
elementAnnotation | emma:info |
---|---|
Definition | Theemma:info element acts as a container forvendor and/or application specific metadata regarding a user'sinput. |
Children | One of more elements in the application namespaceproviding metadata about the input. |
Attributes |
|
Applies to | Theemma:info element is legal only as a child ofthe EMMA elementsemma:emma ,emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:arc , oremma:node . |
InSection 4.2, a series of attributes aredefined for representation of metadata about user inputs in astandardized form. EMMA also provides an extensibility mechanismfor annotation of user inputs with vendor or application specificmetadata not covered by the standard set of EMMA annotations. Theelementemma:info
MUST be used as a container forthese annotations, UNLESS they are explicitly covered byemma:endpoint-info
. For example, if an input to adialog system needed to be annotated with the number that the calloriginated from, their state, some indication of the type ofcustomer, and the name of the service, these pieces of informationcould be represented withinemma:info
as in thefollowing example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:info> <caller_id> <phone_number>2121234567</phone_number> <state>NY</state> </caller_id> <customer_type>residential</customer_type> <service_name>acme_travel_service</service_name> </emma:info> <emma:one-of emma:start="1087995961542" emma:end="1087995963542"emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:confidence="0.75"> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation emma:confidence="0.68"> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of></emma:emma>
It is important to have an EMMA container element forapplication/vendor specific annotations since EMMA elements providea structure for representation of multiple possible interpretationsof the input. As a result it is cumbersome to stateapplication/vendor specific metadata as part of the applicationdata within eachemma:interpretation
. An element isused rather than an attribute so that internal structure can begiven to the annotations withinemma:info
.
In addition toemma:emma
,emma:info
MAY also appear as a child of other structural elements such asemma:interpretation
,emma:info
and so on.Whenemma:info
appears as a child of one of theseelements the application/vendor specific annotations containedwithinemma:info
are assumed to apply to all of theemma:interpretation
elements within the containingelement. The semantics of conflicting annotations inemma:info
, for example when different values are foundwithinemma:emma
andemma:interpretation
,are left to the developer of the vendor/application specificannotations.
emma:endpoint-info
element andemma:endpoint
elementAnnotation | emma:endpoint-info |
---|---|
Definition | Theemma:endpoint-info element acts as a containerfor all application specific annotation regarding the communicationenvironment. |
Children | One or moreemma:endpoint elements. |
Attributes |
|
Applies to | Theemma:endpoint-info elements is legal only as achild ofemma:emma . |
Annotation | emma:endpoint |
Definition | The element acts as a container for application specificendpoint information. |
Children | Elements in the application namespace providing metadata aboutthe input. |
Attributes |
|
Applies to | emma:endpoint-info |
In order to conduct multimodal interaction, there is a need inEMMA to specify the properties of the endpoint that receives theinput which leads to the EMMA annotation. This allows subsequentcomponents to utilize the endpoint properties as well as theannotated inputs to conduct meaningful multimodal interaction. EMMAelementemma:endpoint
can be used for this purpose. Itcan specify the endpoint properties based on a set of commonendpoint property attributes in EMMA, such asemma:endpoint-address
,emma:port-num
,emma:port-type
, etc. (Section4.2.14). Moreover, it provides an extensible annotationstructure that allows the inclusion of application and vendorspecific endpoint properties.
Note that the usage of the term "endpoint" in this context isdifferent from the way that the term is used in speech processing,where it refers to the end of a speech input. As used here,"endpoint" refers to a network location which is the source orrecipient of an EMMA document.
In multimodal interaction, multiple devices can be used and eachdevice can open multiple communication endpoints at the same time.These endpoints are used to transmit and receive data, such as rawinput, EMMA documents, etc. The EMMA elementemma:endpoint
provides a generic representation ofendpoint information which is relevant to multimodal interaction.It allows the annotation to be interoperable, and it eliminates theneed for EMMA processors to create their own specializedannotations for existing protocols, potential protocols or yetundefined private protocols that they may use.
Moreover,emma:endpoint-info
provides a containerto hold all annotations regarding the endpoint information,includingemma:endpoint
and other application andvendor specific annotations that are related to the communication,allowing the same communication environment to be referenced andused in multiple interpretations.
Note that EMMA provides two locations (i.e.emma:info
andemma:endpoint-info
) forspecifying vendor/application specific annotations. If theannotation is specifically related to the description of theendpoint, then the vendor/application specific annotation SHOULD beplaced withinemma:endpoint-info
, otherwise it SHOULDbe placed withinemma:info
.
The following example illustrates the annotation of endpointreference properties in EMMA.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example" xmlns:ex="http://www.example.com/emma/port"> <emma:endpoint-info> <emma:endpoint emma:endpoint-role="sink" emma:endpoint-address="135.61.71.103" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="endpoint2" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="voice"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> <emma:endpoint emma:endpoint-role="source" emma:endpoint-address="136.62.72.104" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="endpoint1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="voice"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> </emma:endpoint-info> <emma:interpretation emma:start="1087995961542" emma:end="1087995963542" emma:endpoint-info-ref="audio-channel-1"
emma:medium="acoustic" emma:mode="voice"> <destination>Chicago</destination> </emma:interpretation></emma:emma>
Theex:app-protocol
is provided by the applicationor the vendor specification. It specifies that the applicationlayer protocol used to establish the speech transmission from the"source" port to the "sink" port is Session Initiation Protocol(SIP). This is specific to SIP based VoIP communication, in whichthe actual media transmission and the call signaling that controlsthe communication sessions, are separated and typically based ondifferent protocols. In the above example, the Real-timeTransmission Protocol (RTP) is used in the media transmissionbetween the source port and the sink port.
emma:tokens
attributeAnnotation | emma:tokens |
---|---|
Definition | An attribute of typexsd:string holding a sequenceof input tokens. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data. |
Theemma:tokens
annotation holds a list of inputtokens. In the following description, the termtokens isused in the computational and syntactic sense ofunits ofinput, and not in the sense ofXML tokens. The valueheld inemma:tokens
is the list of the tokens of inputas produced by the processor which generated the EMMA document;there is no language associated with this value.
In the case where a grammar is used to constrain input, thevalue will correspond to tokens as defined by the grammar. So foran EMMA document produced by input to a SRGS grammar [SRGS], the value ofemma:tokens
will bethe list of words and/or phrases that are defined as tokens in SRGS(see Section 2.1of [SRGS]). Items in theemma:tokens
list are delimited by white space and/or quotation marks forphrases containing white space. For example:
emma:tokens="arriving at 'Liverpool Street'"
where the three tokens of input arearriving,atandLiverpool Street.
Theemma:tokens
annotation MAY be applied not justto the lexical words and phrases of language but to any level ofinput processing. Other examples of tokenization include phonemes,ink strokes, gestures and any other discrete units of input at anylevel.
Examples:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:tokens="From Cambridge to London tomorrow"
emma:medium="acoustic" emma:mode="voice"> <origin emma:tokens="From Cambridge">Cambridge</origin> <destination emma:tokens="to London">London</destination> <date emma:tokens="tomorrow">20030315</date> </emma:interpretation></emma:emma>
emma:process
attributeAnnotation | emma:process |
---|---|
Definition | An attribute of typexsd:anyURI referencing theprocess used to generate the interpretation. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence |
A reference to the information concerning the processing thatwas used for generating an interpretation MAY be made using theemma:process
attribute. For example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation
emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation emma:process="http://example.com/mysemproc1.xml"> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> <emma:derived-from resource="#raw"/> </emma:interpretation> </emma:derivation> <emma:interpretation emma:process="http://example.com/mysemproc2.xml"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> <emma:derived-from resource="#better"/> </emma:interpretation></emma:emma>
The process description document, referenced by theemma:process
annotation MAY include information on theprocess itself, such as grammar, type of parser, etc. EMMA is notnormative about the format of the process description document.
emma:no-input
attributeAnnotation | emma:no-input |
---|---|
Definition | Attribute holdingxsd:boolean value that is trueif there was no input. |
Applies to | emma:interpretation |
The case of lack of input MUST be annotated as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:no-input="true"
emma:medium="acoustic" emma:mode="voice"/></emma:emma>
If theemma:interpretation
is annotated withemma:no-input="true"
then theemma:interpretation
MUST be empty.
emma:uninterpreted
attributeAnnotation | emma:uninterpreted |
---|---|
Definition | Attribute holdingxsd:boolean value that is trueifno interpretation was produced in response to theinput |
Applies to | emma:interpretation |
Anemma:interpretation
element representing inputfor which no interpretation was produced MUST beannotated withemma:uninterpreted="true"
. Forexample:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:uninterpreted="true"
emma:medium="acoustic" emma:mode="voice"/></emma:emma>
The notation for uninterpreted input MAY refer to any possiblestage of interpretation processing, including raw transcriptions.For instance, no interpretation would be produced for stagesperforming pure signal capture such as audio recordings. Likewise,if a spoken input was recognized but cannot be parsed by a languageunderstanding component, it can be tagged asemma:uninterpreted
as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:process="http://example.com/mynlu.xml" emma:uninterpreted="true" emma:tokens="From Cambridge to London tomorrow"
emma:medium="acoustic" emma:mode="voice"/></emma:emma>
Theemma:interpretation
MUST be emptyif theemma:interpretation
element isannotated withemma:uninterpreted="true"
.
emma:lang
attributeAnnotation | emma:lang |
---|---|
Definition | An attribute of typexsd:language indicating thelanguage for the input. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data. |
Theemma:lang
annotation is used to indicate thehuman language for the input that it annotates. The values of theemma:lang
attribute are language identifiers asdefined byIETF Best Current Practice 47 [BCP47]. For example,emma:lang="fr"
denotes French, andemma:lang="en-US"
denotes US English.emma:lang
MAY be applied to anyemma:interpretation
element. Its annotative scopefollows the annotative scope of these elements. Unlike thexml:lang
attribute in XML,emma:lang
doesnot specify the language used by element contents or attributevalues.
The following example shows the use ofemma:lang
for annotating an input interpretation.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:lang="fr"
emma:medium="acoustic" emma:mode="voice"> <answer>arretez</answer> </emma:interpretation></emma:emma>
Many kinds of input including some inputs made through pen,computer vision, and other kinds of sensors are inherentlynon-linguistic. Examples include drawing areas, arrows etc. using apen and music input for tune recognition. If these non-linguisticinputs are annotated withemma:lang
then they MUST beannotated asemma:lang="zxx"
. For example, pen inputwhere a user circles an area on map display could be represented asfollows whereemma:lang="zxx"
indicates that the inkinput is not in any human language.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="tactile" emma:mode="ink" emma:lang="zxx"> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </emma:interpretation></emma:emma>
If inputs for which there is no information about whether thesource input is in a particular human language, and if so whichlanguage, are annotated withemma:lang,
then they MUSTbe annotated asemma:lang=""
. Furthermore, in caseswhere there is not explicitemma:lang
annotation, andnone is inherited from a higher element in the document, thedefault value foremma:lang
is""
meaningthat there is no information about whether the source input is in alanguage and if so which language.
Thexml:lang
andemma:lang
attributesserve uniquely different and equally important purposes. The roleof thexml:lang
attribute in XML 1.0 is to indicatethe language used for character data content in an XML element ordocument. In contrast, theemma:lang
attribute is usedto indicate the language employed by a user when entering an input.Critically,emma:lang
annotates the language of thesignal originating from the user rather than the specific tokensused at a particular stage of processing. This is most clearlyillustrated through consideration of an example involving multiplestages of processing of a user input. Consider the followingscenario: EMMA is being used to represent three stages in theprocessing of a spoken input to an system for ordering products.The user input is in Italian, after speech recognition, the userinput is first translated into English, then a natural languageunderstanding system converts the English translation into aproduct ID (which is not in any particular language). Since theinput signal is a user speaking Italian, theemma:lang
will beemma:lang="it"
on all of these three stages ofprocessing. Thexml:lang
attribute, in contrast, willinitially be"it"
, after translation thexml:lang
will be"en-US"
, and afterlanguage understanding it will be"zxx"
since theproduct ID is non-linguistic content. The following are examples ofEMMA documents corresponding to these three processing stages,abbreviated to show the critical attributes for discussion here.Note that<transcription>
,<translation>
, and<understanding>
are application namespaceattributes, not part of the EMMA markup.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:lang="it" emma:mode="voice" emma:medium="acoustic">
<transcription xml:lang="it">condizionatore</transcription>
</emma:interpretation></emma:emma>
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:lang="it" emma:mode="voice" emma:medium="acoustic"> <translation xml:lang="en-US">air conditioner</translation>
</emma:interpretation></emma:emma>
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:lang="it" emma:mode="voice" emma:medium="acoustic">
<understanding xml:lang="zxx">id1456</understanding>
</emma:interpretation></emma:emma>
In orderto handle inputs involving multiplelanguages, such as through code switching, theemma:lang
tag MAY contain several language identifiersseparated by spaces.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:tokens="please stop arretez s'il vous plait" emma:lang="en fr"emma:medium="acoustic" emma:mode="voice"> <command> CANCEL </command> </emma:interpretation></emma:emma>
emma:signal
andemma:signal-size
attributesAnnotation | emma:signal |
---|---|
Definition | An attribute of typexsd:anyURI referencing theinput signal. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence ,and application instance data. |
Annotation | emma:signal-size |
Definition | An attributeof typexsd:nonNegativeInteger specifying the size in eight bit octets of the referencedsource. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence ,and application instance data. |
A URI reference to the signal that originated the inputrecognition process MAY be represented in EMMA using theemma:signal
annotation.
Here is an example where the reference to a speech signal isrepresented using theemma:signal
annotation on theemma:interpretation
element:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:signal="http://example.com/signals/sg23.bin"
emma:medium="acoustic" emma:mode="voice"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
Theemma:signal-size
annotation can be used todeclare the exact size of the associated signal in 8-bit octets. Anexample of the use of an EMMA document to represent a recording,withemma:signal-size
indicating the size is asfollows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="acoustic" emma:mode="voice" emma:function="recording" emma:uninterpreted="true" emma:signal="http://example.com/signals/recording.mpg" emma:signal-size="82102" emma:duration="10000"> </emma:interpretation></emma:emma>
emma:media-type
attributeAnnotation | emma:media-type |
---|---|
Definition | An attribute of typexsd:string holding the MIMEtype associated with the signal's data format. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence ,emma:endpoint ,and application instancedata. |
The data format of the signal that originated the input MAY berepresented in EMMA using theemma:media-type
annotation. An initial set of MIME media types is defined by[RFC2046].
Here is an example where the media type for the ETSI ES 202 212audio codec for Distributed Speech Recognition (DSR) is applied totheemma:interpretation
element. The example alsospecifies an optional sampling rate of 8 kHz and maxptime of 40milliseconds.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:signal="http://example.com/signals/signal.dsr" emma:media-type="audio/dsr-es202212; rate:8000; maxptime:40"
emma:medium="acoustic" emma:mode="voice"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
emma:confidence
attributeAnnotation | emma:confidence |
---|---|
Definition | An attribute of typexsd:decimal in range 0.0 to1.0, indicating the processor's confidence in the result. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence , andapplication instance data. |
The confidence score in EMMA is used to indicate the quality ofthe input, and if confidence is annotated on an input it MUST begiven as the value ofemma:confidence
. The confidencescore MUST be a number in the range from 0.0 to 1.0 inclusive. Avalue of 0.0 indicates minimum confidence, and a value of 1.0indicates maximum confidence. Note thatemma:confidence
represents not only the confidence ofthe speech recognizer, but rather the confidence of the whateverprocessor was responsible for creating the EMMA result, based onwhatever evidence it has. For a natural language interpretation,for example, this might include semantic heuristics in addition tospeech recognition scores. Moreover, the confidence score values donot have to be interpreted as probabilities. In fact confidencescore values are platform-dependent, since their computation islikely to differ between platforms and different EMMA processors.Confidence scores are annotated explicitly in EMMA in order toprovide this information to the subsequent processes for multimodalinteraction. The example below illustrates how confidence scoresare annotated in EMMA.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:confidence="0.6"> <location>Boston</location> </emma:interpretation> <emma:interpretation emma:confidence="0.4"> <location> Austin </location> </emma:interpretation> </emma:one-of></emma:emma>
In addition to its use as an attribute on the EMMAinterpretation and container elements, theemma:confidence
attribute MAY also be used to assignconfidences to elements in instance data in the applicationnamespace. This can be seen in the following example, where the<destination>
and<origin>
elements have confidences.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:confidence="0.6"emma:medium="acoustic" emma:mode="voice"> <destination emma:confidence="0.8"> Boston</destination> <origin emma:confidence="0.6"> Austin </origin> </emma:interpretation></emma:emma>
Although in general instance data can be represented in XMLusing a combination of elements and attributes in the applicationnamespace, EMMA does not provide a standard way to annotateprocessors' confidences in attributes. Consequently, instance datathat is expected to be assigned confidences SHOULD be representedusing elements, as in the above example.
emma:source
attributeAnnotation | emma:source |
---|---|
Definition | An attribute of typexsd:anyURI referencing thesource of input. |
Applies to | emma:interpretation ,emma:one-of ,emma:group ,emma:sequence , andapplication instance data. |
The source of an interpreted input MAY be represented in EMMA asa URI resource using theemma:source
annotation.
Here is an example that shows different input sources fordifferent input interpretations.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example" xmlns:myapp="http://www.example.com/myapp"> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:source="http://example.com/microphone/NC-61"> <myapp:destination>Boston</myapp:destination> </emma:interpretation> <emma:interpretation emma:source="http://example.com/microphone/NC-4024"> <myapp:destination>Austin</myapp:destination> </emma:interpretation> </emma:one-of></emma:emma>
The start and end times for input MAY be indicated using eitherabsolute timestamps or relative timestamps. Both are inmilliseconds for ease in processing timestamps. Note that theECMAScript Date object'sgetTime()
function is aconvenient way to determine the absolute time.
emma:start
,emma:end
attributesAnnotation | emma:start, emma:end |
---|---|
Definition | Attributesof typexsd:nonNegativeInteger indicating the absolutestarting and ending times of an input in terms of the number ofmilliseconds since 1 January 1970 00:00:00 GMT |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:arc ,and application instancedata |
Here is an example of a timestamp for an absolute time.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:start="1087995961542" emma:end="1087995963542"
emma:medium="acoustic" emma:mode="voice"> <destination>Chicago</destination> </emma:interpretation></emma:emma>
Theemma:start
andemma:end
annotations on an input MAY be identical, however theemma:end
value MUST NOT be less than theemma:start
value.
emma:time-ref-uri
,emma:time-ref-anchor-point
,emma:offset-to-start
attributesAnnotation | emma:time-ref-uri |
---|---|
Definition | Attribute of typexsd:anyURI indicating the URIused to anchor the relative timestamp. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:lattice ,and application instancedata |
Annotation | emma:time-ref-anchor-point |
Definition | Attribute with a value ofstart orend , defaulting tostart . It indicateswhether to measure the time from the start or end of the intervaldesignated withemma:time-ref-uri . |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:lattice ,and application instancedata |
Annotation | emma:offset-to-start |
Definition | Attributeof typexsd:integer ,defaulting to zero. It specifies the offset in milliseconds for thestart of input from the anchor point designated withemma:time-ref-uri andemma:time-ref-anchor-point |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:arc ,and application instancedata |
Relative timestamps define the start of an input relative to thestart or end of a reference interval such as another input.
The reference interval is designated withemma:time-ref-uri
attribute. This MAY be combined withemma:time-ref-anchor-point
attribute to specifywhether the anchor point is the start or end of this interval. Thestart of an input relative to this anchor point is then specifiedwithemma:offset-to-start
attribute.
Here is an example where the referenced input is in the samedocument:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:sequence> <emma:interpretation
emma:medium="acoustic" emma:mode="voice"> <origin>Denver</origin> </emma:interpretation> <emma:interpretation
emma:medium="acoustic" emma:mode="voice" emma:time-ref-uri="#int1" emma:time-ref-anchor-point="start" emma:offset-to-start="5000"> <destination>Chicago</destination> </emma:interpretation> </emma:sequence></emma:emma>
Note that the reference point refers to an input, but notnecessarily to a complete input. For example, if a speechrecognizer timestamps each word in an utterance, the anchor pointmight refer to the timestamp for just one word.
The absolute and relative timestamps are not mutually exclusive;that is, it is possible to have both relative and absolutetimestamp attributes on the same EMMA container element.
Timestamps of inputs collected by different devices will besubject to variation if the times maintained by the devices are notsynchronized. This concern is outside of the scope of the EMMAspecification.
emma:duration
attributeAnnotation | emma:duration |
---|---|
Definition | Attributeof typexsd:nonNegativeInteger , defaulting to zero. Itspecifies the duration of the input in milliseconds. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:arc ,and application instancedata |
The duration of an input in milliseconds MAY be specified withtheemma:duration
attribute. Theemma:duration
attribute MAY be used either incombination with timestamps or independently, for example in theannotation of speech corpora.
In the following example, the duration of the signal that gaverise to the interpretation is indicated usingemma:duration
.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:duration="2300"
emma:medium="acoustic" emma:mode="voice"> <origin>Denver</origin> </emma:interpretation></emma:emma>
This section is informative.
The following table provides guidance on how to determine thevalues of relative timestamps on a composite input.
emma:time-ref-uri | If the reference interval URI is the same for both inputs thenit should be the same for the composite input. If it is not thesame then relative timestamps will have to be resolved to absolutetimestamps in order to determine the combined timestamp. . |
emma:time-ref-anchor-point | If the anchor value is the same for both inputs then it shouldbe the same for the composite input. If it is not the same thenrelative timestamps will have to be resolved to absolute timestampsin order to determine the combined timestamp. |
emma:offset-to-start | Given that theemma:time-ref-uri andemma:time-ref-anchor-point are the same for bothcombining inputs, then theemma:offset-to-start forthe combination should be the lesser of the two. If they are notthe same then relative timestamps will have to be resolved toabsolute timestamps in order to determine the combinedtimestamp. |
emma:duration | Given that theemma:time-ref-uri andemma:time-ref-anchor-point are the same for bothcombining inputs, then theemma:duration is calculatedas follows. Add together theemma:offset-to-start andemma:duration for each of the inputs. Take whicheverof these is greater and subtract from it the lesser of theemma:offset-to-start values in order to determine thecombined duration. Ifemma:time-ref-uri andemma:time-ref-anchor-point are not the same thenrelative timestamps will have to be resolved to absolute timestampsin order to determine the combined timestamp. |
emma:medium
,emma:mode
,emma:function
,emma:verbal
attributesAnnotation | emma:medium |
---|---|
Definition | An attribute of typexsd:nmtokens which contains a space delimited set of values from theset {acoustic ,tactile ,visual }. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:endpoint , and application instance data |
Annotation | emma:mode |
Definition | An attribute of typexsd:nmtokens which contains a space delimited set of values from anopen set of values including: {voice ,dtmf ,ink ,gui ,keys ,video ,photograph ,...}. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:endpoint , and application instance data |
Annotation | emma:function |
Definition | An attribute of typexsd:string constrained tovalues in the open set {recording ,transcription ,dialog ,verification , ...}. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data |
Annotation | emma:verbal |
Definition | An attribute of typexsd:boolean . |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data |
EMMA provides two properties for the annotation of inputmodality. One indicating the broader medium or channel(emma:medium
) and another indicating the specific modeof communication used on that channel (emma:mode
). Theinput medium is defined from the users perspective and indicateswhether they use their voice (acoustic
), touch(tactile
), or visual appearance/motion(visual
) as input. Tactile includes mosthand-on input device types such as pen, mouse, keyboard, andtouch screen. Visual is used for camera input.
emma:medium =space delimited sequence of values from the set: [acoustic|tactile|visual]
The mode property provides the ability to distinguish betweendifferent modes of communication that may be within a particularmedium. For example, in the tactile medium, modes includeelectronic ink (ink
), and pointing and clicking on agraphical user interface (gui
).
emma:mode =space delimited sequence of values from the set: [voice|dtmf|ink|gui|keys|video|photograph| ... ]
Theemma:medium
classification is based on theboundary between the user and the device that they use. Foremma:medium="tactile"
the user physically touches thedevice in order to provide input. Foremma:medium="visual"
the user's movement is capturedby sensors (cameras, infrared) resulting in an input to the system.In the case whereemma:medium="acoustic"
the userprovides input to the system by producing an acoustic signal. Notethen that DTMF input will be classified asemma:medium="tactile"
since in order to provide DTMFinput the user physically presses keys on a keypad.
Whileemma:medium
andemma:mode
areoptional on specific elements such asemma:interpretation
andemma:one-of
, notethat all EMMA interpretations must be annotated foremma:medium
andemma:mode
, so eitherthese attributes must appear directly onemma:interpretation
or they must appear on an ancestoremma:one-of
node or they must appear on an earlierstage of the derivation listed inemma:derivation
.
Orthogonal to the mode, user inputs can also be classified withrespect to their communicative function. This enables a simplermode classification.
emma:function = [recording|transcription|dialog|verification| ... ]
For example, speech can be used for recording (e.g. voicemail),transcription (e.g. dictation), dialog (e.g. interactive spokendialog systems), and verification (e.g. identifying users throughtheir voiceprints).
EMMA also supports an additional propertyemma:verbal
which distinguishes verbal use of an inputmode from non-verbal. This MAY be used to distinguish the use ofelectronic ink to convey handwritten commands from the user ofelectronic ink for symbolic gestures such as circles and arrows.Handwritten commands, such as writingdowntown in order tochange a map display to show the downtown are classified as verbal(emma:function="dialog" emma:verbal="true"
). Pengestures (arrows, lines, circles, etc), such as circling abuilding, are classified as non-verbal dialog(emma:function="dialog" emma:verbal="false"
). The useof handwritten words to transcribe an email message is classifiedas transcription (emma:function="transcription"emma:verbal="true"
).
emma:verbal = [true|false]
Handwritten words and ink gestures are typically recognizedusing different kinds of recognition components (handwritingrecognizer vs. gesture recognizer) and the verbal annotation willbe added by the recognition component which classifies the input.The original input source, a pen in this case, will not be aware ofthis difference. The input source identifier will tell you that theinput was from a pen of some kind but will not tell you if the modeof input was handwriting (show downtown) or gesture (e.g.circling an object or area).
Here is an example of the EMMA annotation for a pen input wherethe user's ink is recognized as either a word ("Boston") or as anarrow:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of> <emma:interpretation emma:confidence="0.6" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="true"> <location>Boston</location> </emma:interpretation> <emma:interpretation emma:confidence="0.4" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="false"> <direction>45</direction> </emma:interpretation> </emma:one-of></emma:emma>
Here is an example of the EMMA annotation for a spoken commandwhich is recognized as either "Boston" or "Austin":
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-of> <emma:interpretation emma:confidence="0.6" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true"> <location>Boston</location> </emma:interpretation> <emma:interpretation emma:confidence="0.4" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true"> <location>Austin</location> </emma:interpretation> </emma:one-of></emma:emma>
The following table shows the relationship between the medium,mode, and function properties and serves as an aid for classifyinginputs. For the dialog function it also shows some examples of theclassification of inputs as verbal vs. non-verbal.
Medium | Device | Mode | Function | |||
---|---|---|---|---|---|---|
recording | dialog | transcription | verification | |||
acoustic | microphone | voice | audiofile (e.g. voicemail) | spoken command / query / response (verbal = true) | dictation | speaker recognition |
singing a note (verbal = false) | ||||||
tactile | keypad | dtmf | audiofile / character stream | typed command / query / response (verbal = true) | text entry (T9-tegic, word completion, or wordgrammar) | password / pin entry |
command key "Press 9 for sales" (verbal = false) | ||||||
keyboard | dtmf | character / key-code stream | typed command / query / response (verbal = true) | typing | password / pin entry | |
command key "Press S for sales" (verbal = false) | ||||||
pen | ink | trace, sketch | handwritten command / query / response (verbal = true) | handwritten text entry | signature, handwriting recognition | |
gesture (e.g. circling building) (verbal = false) | ||||||
gui | N/A | tapping on named button (verbal = true) | soft keyboard | password / pin entry | ||
drag and drop, tapping on map (verbal = false) | ||||||
mouse | ink | trace, sketch | handwritten command / query / response (verbal = true) | handwritten text entry | N/A | |
gesture (e.g. circling building) (verbal = false) | ||||||
gui | N/A | clicking named button (verbal = true) | soft keyboard | password / pin entry | ||
drag and drop, clicking on map (verbal = false) | ||||||
joystick | ink | trace,sketch | gesture (e.g. circling building) (verbal = false) | N/A | N/A | |
gui | N/A | pointing, clicking button / menu (verbal = false) | soft keyboard | password / pin entry | ||
visual | page scanner | photograph | image | handwritten command / query / response (verbal = true) | optical character recognition, object/scenerecognition (markup, e.g. SVG) | N/A |
drawings and images (verbal = false) | ||||||
still camera | photograph | image | objects (verbal = false) | visual object/scene recognition | face id, retinal scan | |
video camera | video | movie | sign language (verbal = true) | audio/visual recognition | face id, gait id, retinal scan | |
face / hand / arm / body gesture (e.g. pointing, facing)(verbal = false) |
emma:hook
attributeAnnotation | emma:hook |
---|---|
Definition | An attribute of typexsd:string constrained tovalues in the open set {voice ,dtmf ,ink ,gui ,keys ,video ,photograph , ...} or the wildcardany |
Applies to | Application instance data |
The attributeemma:hook
MAY be used to mark theelements in the application semantics within anemma:interpretation
which are expected to beintegrated with content from input in another mode to yield acomplete interpretation. Theemma:mode
to beintegrated at that point in the application semantics is indicatedas the value of theemma:hook
attribute. The possiblevalues ofemma:hook
are the list of input modes thatcan be values ofemma:mode
(seeSection 4.2.11). In addition to these, thevalue ofemma:hook
can also be the wildcardany
indicating that the other content can come fromany source. The annotationemma:hook
differs insemantics fromemma:mode
as follows. Annotating anelement in the application semantics withemma:mode="ink"
indicates that that part of thesemantics came from theink
mode. Annotating anelement in the application semantics withemma:hook="ink"
indicates that part of the semanticsneeds to be integrated with content from theink
mode.
To illustrate the use ofemma:hook
consider anexample composite input in which the user says "zoom in here" inthe speech input mode while drawing an area on a graphical displayin the ink input mode.The fact that thelocation
element needs to come from theink
mode is indicated by annotating this applicationnamespace element usingemma:hook
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretationemma:medium="acoustic" emma:mode="voice"> <command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command> </emma:interpretation></emma:emma>
For more detailed explanation of this example seeAppendix C.
emma:cost
attributeAnnotation | emma:cost |
---|---|
Definition | An attribute of typexsd:decimal in range 0.0 to10000000, indicating the processor's cost or weight associated withan input or part of an input. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence ,emma:arc ,emma:node , and applicationinstance data. |
The cost annotation in EMMA indicates the weight or costassociated with an user's input or part of their input. The mostcommon use ofemma:cost
is for representing the costsencoded on a lattice output from speech recognition or otherrecognition or understanding processes.emma:cost
MAYalso be used to indicate the total cost associated with particularrecognition results or semantic interpretations.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:one-ofemma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:cost="1600"> <location>Boston</location> </emma:interpretation> <emma:interpretation emma:cost="400"> <location> Austin </location> </emma:interpretation> </emma:one-of></emma:emma>
emma:endpoint-role
,emma:endpoint-address
,emma:port-type
,emma:port-num
,emma:message-id
,emma:service-name
,emma:endpoint-pair-ref
,emma:endpoint-info-ref
attributesAnnotation | emma:endpoint-role |
---|---|
Definition | An attribute of typexsd:string constrained tovalues in the set {source ,sink ,reply-to ,router }. |
Applies to | emma:endpoint |
Annotation | emma:endpoint-address |
Definition | An attribute of typexsd:anyURI that uniquelyspecifies the network address of theemma:endpoint . |
Applies to | emma:endpoint |
Annotation | emma:port-type |
Definition | An attribute of typexsd:QName that specifies thetype of the port. |
Applies to | emma:endpoint |
Annotation | emma:port-num |
Definition | An attribute of typexsd:nonNegativeInteger thatspecifies the port number. |
Applies to | emma:endpoint |
Annotation | emma:message-id |
Definition | An attribute of typexsd:anyURI that specifies themessage ID associated with the data. |
Applies to | emma:endpoint |
Annotation | emma:service-name |
Definition | An attribute of typexsd:string that specifies thename of the service. |
Applies to | emma:endpoint |
Annotation | emma:endpoint-pair-ref |
Definition | An attribute of typexsd:anyURI that specifies thepairing between sink and source endpoints. |
Applies to | emma:endpoint |
Annotation | emma:endpoint-info-ref |
Definition | An attribute of typexsd:IDREF referring to theid attribute of anemma:endpoint-info element. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data. |
Theemma:endpoint-role
attribute specifies the rolethat the particularemma:endpoint
performs inmultimodal interaction. The role valuesink
indicatesthat the particular endpoint is the receiver of the input data. Therole valuesource
indicates that the particularendpoint is the sender of the input data. The role valuereply-to
indicates that the particularemma:endpoint
is the intended endpoint for the reply.The sameemma:endpoint-address
MAY appear in multipleemma:endpoint
elements, provided that the sameendpoint address is used to serve multiple roles, e.g. sink,source, reply-to, router, etc., or associated with multipleinterpretations.
Theemma:endpoint-address
specifies the networkaddress of theemma:endpoint
, andemma:port-type
specifies the port type of theemma:endpoint
. Theemma:port-num
annotates the port number of the endpoint (e.g. the typical portnumber for an http endpoint is 80). Theemma:message-id
annotates the message ID informationassociated with the annotated input. This meta information is usedto establish and maintain the communication context for bothinbound processing and outbound operation. The servicespecification of theemma:endpoint
is annotated byemma:service-name
which contains the definition of theservice that theemma:endpoint
performs. The matchingof thesink
endpoint and its pairingsource
endpoint is annotated by theemma:endpoint-pair-ref
attribute. One sink endpointMAY link to multiple source endpoints throughemma:endpoint-pair-ref
. Further bounding of theemma:endpoint
is possible by using the annotation ofemma:group
(seeSection3.3.2).
Theemma:endpoint-info-ref
attribute associates theEMMA result in the container element with anemma:endpoint-info
element.
The following example illustrates the use of these attributes inmultimodal interactions where multiple modalities are used.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example" xmlns:ex="http://www.example.com/emma/port"> <emma:endpoint-info > <emma:endpoint emma:endpoint-role="sink" emma:endpoint-address="135.61.71.103" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="endpoint2" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="voice"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> <emma:endpoint emma:endpoint-role="source" emma:endpoint-address="136.62.72.104" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="endpoint1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="voice"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> </emma:endpoint-info> <emma:endpoint-info> <emma:endpoint emma:endpoint-role="sink" emma:endpoint-address="http://emma.example/sink" emma:endpoint-pair-ref="endpoint4" emma:port-num="80" emma:port-type="http" emma:message-id="uuid:2e5678" emma:service-name="travel" emma:mode="ink"/> <emma:endpoint emma:endpoint-role="source" emma:port-address="http://emma.example/source" emma:endpoint-pair-ref="endpoint3" emma:port-num="80" emma:port-type="http" emma:message-id="uuid:2e5678" emma:service-name="travel" emma:mode="ink"/> </emma:endpoint-info> <emma:group> <emma:interpretation emma:start="1087995961542" emma:end="1087995963542" emma:endpoint-info-ref="audio-channel-1"
emma:medium="acoustic" emma:mode="voice"> <destination>Chicago</destination> </emma:interpretation> <emma:interpretation emma:start="1087995961542" emma:end="1087995963542" emma:endpoint-info-ref="ink-channel-1"
emma:medium="acoustic" emma:mode="voice"> <location> <type>area</type> <points>34.13 -37.12 42.13 -37.12 ... </points> </location> </emma:interpretation> </emma:group></emma:emma>
emma:grammar
element:emma:grammar-ref
attributeAnnotation | emma:grammar-ref |
---|---|
Definition | An attribute of typexsd:IDREF referring to theid attribute of anemma:grammar element. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence . |
Theemma:grammar-ref
annotation associates the EMMAresult in the container element with anemma:grammar
element.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:grammarref="someURI"/> <emma:grammarref="anotherURI"/> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:grammar-ref="gram1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation emma:grammar-ref="gram1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation emma:grammar-ref="gram2"> <command>help</command> </emma:interpretation> </emma:one-of></emma:emma>
emma:model
element:emma:model-ref
attributeAnnotation | emma:model-ref |
---|---|
Definition | An attribute of typexsd:IDREF referring to theid attribute of anemma:model element. |
Applies to | emma:interpretation ,emma:group ,emma:one-of ,emma:sequence , andapplication instance data. |
Theemma:model-ref
annotation associates the EMMAresult in the container element with anemma:model
element.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:model ref="someURI"/> <emma:model ref="anotherURI"/> <emma:one-of
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation emma:model-ref="model1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation emma:model-ref="model1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation emma:model-ref="model2"> <command>help</command> </emma:interpretation> </emma:one-of></emma:emma>
emma:dialog-turn
attributeAnnotation | emma:dialog-turn |
---|---|
Definition | An attribute of typexsd:string referring to thedialog turn associated with a given container element. |
Applies to | emma:interpretation ,emma:group ,emma:one-of , andemma:sequence . |
Theemma:dialog-turn
annotation associates the EMMAresult in the container element with a dialog turn. The syntax andsemantics of dialog turns is left open to suit the needs ofindividual applications. For example, some applications might usean integer value, where successive turns are represented bysuccessive integers. Other applications might combine a name of adialog participant with an integer value representing the turnnumber for that participant. Ordering semantics for comparison ofemma:dialog-turn
is deliberately unspecified and leftfor applications to define.
Example:
<emma:emma version="1.0" emma="http://www.w3.org/2003/04/emma" xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:dialog-turn="u8"
emma:medium="acoustic" emma:mode="voice"> <quantity>3</quantity> </emma:interpretation></emma:emma>
Theemma:derived-from
element (Section 4.1.2) can be used to capture both sequentialand composite derivations. This section concerns the scope of EMMAannotations acrosssequential derivations of userinput connected using theemma:derived-from
element(Section 4.1.2). Sequential derivationsinvolve processing steps that do not involve multimodalintegration, such as applying natural language understanding andthen reference resolution to a speech transcription. EMMAderivations describe only single turns of user input and are notintended to describe a sequence of dialog turns.
For example, an EMMA document could containemma:interpretation
elements for the transcription,interpretation, and reference resolution of a speech input,utilizing theid
values:raw
,better
, andbest
respectively:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation emma:process="http://example.com/myasr1.xml"emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation emma:process="http://example.com/mynlu1.xml"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:derivation> <emma:interpretation emma:process="http://example.com/myrefresolution1.xml"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
Each member of the derivation chain is linked to the previousone by aderived-from
element (Section 4.1.2), which has an attributeresource
that provides a pointer to theemma:interpretation
from which it is derived. Theemma:process
annotation (Section4.2.2) provides a pointer to the process used for each stage ofthe derivation.
The following EMMA example represents the same derivation asabove but with a more fully specified set of annotations:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation emma:process="http://example.com/myasr1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation emma:process="http://example.com/mynlu1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:derivation> <emma:interpretation emma:process="http://example.com/myrefresolution1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
EMMA annotations on earlier stages of the derivation oftenremain accurate at later stages of the derivation. Although thiscan be captured in EMMA by repeating the annotations on eachemma:interpretation
within the derivation, as in theexample above, there are two disadvantages of this approach toannotation. First, the repetition of annotations makes theresulting EMMA documents significantly more verbose. Second, EMMAprocessors used for intermediate tasks such as natural languageunderstanding and reference resolution will need to read in all ofthe annotations and write them all out again.
EMMA overcomes these problems by assuming that annotations onearlier stages of a derivation automatically apply to later stagesof the derivation unless a new value is specified. Later stages ofthe derivation essentially inherit annotations from earlier stagesin the derivation. For example, if there was anemma:source
annotation on the transcription(raw
) it would also apply to the later stages of thederivation such as the result of natural language understanding(better
) or reference resolution(best
).
Because of the assumption in EMMA that annotations have scopeover later stages of a sequential derivation, the example EMMAdocument above can be equivalently represented as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:derivation> <emma:interpretation emma:process="http://example.com/myasr1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation emma:process="http://example.com/mynlu1.xml" emma:confidence="0.8"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:derivation> <emma:interpretation emma:process="http://example.com/myrefresolution1.xml"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
The fully specified derivation illustrated above is equivalentto the reduced form derivation following it where only annotationswith new values are specified at each stage. These two EMMAdocuments MUST yield the same result when processed by an EMMAprocessor.
Theemma:confidence
annotation is respecified onthebetter
interpretation. This indicates theconfidence score for natural language understanding, whereasemma:confidence
on theraw
interpretationindicates the speech recognition confidence score.
In order to determine the full set of annotations that apply toanemma:interpretation
element an EMMA processor orscript needs to access the annotations directly on that element andfor any that are not specified follow the reference in theresource
attribute of theemma:derived-from
element to add in annotations fromearlier stages of the derivation.
The EMMA annotations break down into three groups with respectto their scope in sequential derivations. One group of annotationsalways holds true for all members of a sequentialderivation. A second groupis always respecified oneach stage of the derivation. A third group may or may not berespecified.
Classification | Annotation |
---|---|
Applies to whole derivation | emma:signal |
emma:signal-size | |
emma:dialog-turn | |
emma:source | |
emma:medium | |
emma:mode | |
emma:function | |
emma:verbal | |
emma:lang | |
emma:tokens | |
emma:start | |
emma:end | |
emma:time-ref-uri | |
emma:time-ref-anchor-point | |
emma:offset-to-start | |
emma:duration | |
Specified at each stage of derivation | emma:derived-from |
emma:process | |
May be respecified | emma:confidence |
emma:cost | |
emma:grammar-ref | |
emma:model-ref | |
emma:no-input | |
emma:uninterpreted |
One potential problem with this annotation scoping mechanism isthat earlier annotations could be lost if earlier stages of aderivation were dropped in order to reduce message size. Thisproblem can be overcome by considering annotation scope at thepoint where earlier derivation stages are discarded and populatingthe final interpretation in the derivation with all of theannotations which it could inherit. For example, if theraw
andbetter
stages were dropped theresulting EMMA document would be:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/myrefresolution1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation></emma:emma>
Annotations on anemma:one-of
element are assumedto apply to all of the container elements within theemma:one-of
.
Ifemma:one-of
appears with anotheremma:one-of
then annotations on the parentemma:one-of
are assumed to apply to the children ofthe childemma:one-of
.
Annotations onemma:group
oremma:sequence
do not apply to their childelements.
The contents of this section are normative.
A document is a Conforming EMMA Document if it meets both thefollowing conditions:
The EMMA specification and these conformance criteria provide nodesignated size limits on any aspect of EMMA documents. There areno maximum values on the number of elements, the amount ofcharacter data, or the number of characters in attributevalues.
Within this specification, the term URI refers to aUniversal Resource Identifier as defined in [RFC3986] and extended in [RFC3987] with the new name IRI. The term URI hasbeen retained in preference to IRI to avoid introducing new namesfor concepts such as "Base URI" that are defined or referencedacross the whole family of XML specifications.
The EMMA namespace is intended to be used with other XMLnamespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C is expected to address waysto specify conformance for documents involving multiplenamespaces.
A EMMA processor is a program that can process and/or generateConforming EMMA documents.
In a Conforming EMMA Processor, the XML parser MUST be able toparse and process all XML constructs defined by XML 1.1 [XML] and Namespaces in XML [XMLNS].It is not required that a Conforming EMMA Processor uses avalidating XML parser.
A Conforming EMMA Processor MUST correctly understand and applythe semantics of each markup element or attribute as described bythis document.
There is, however, no conformance requirement with respect toperformance characteristics of the EMMA Processor. For instance, nostatement is required regarding the accuracy, speed or othercharacteristics of output produced by the processor. No statementis made regarding the size of input that a EMMA Processor isrequired to support.
This section is Normative.
This section defines the formal syntax for EMMA documents interms of a normative XML Schema.
There are both an XML Schema andRELAX NG Schemafor the EMMA markup. The latest version of the XML Schema for EMMAis available athttp://www.w3.org/TR/emma/emma.xsdand the RELAX NG Schema can be found athttp://www.w3.org/TR/emma/emma.rng.
For stability it is RECOMMENDED that you use the dated URIavailable athttp://www.w3.org/TR/2009/REC-emma-20090210/emma.xsdandhttp://www.w3.org/TR/2009/REC-emma-20090210/emma.rng.
This section isNormative.
This appendix registers a new MIME media type,"application/emma+xml
".
The "application/emma+xml
" media type isregistered with IANA athttp://www.iana.org/assignments/media-types/application/.
application
emma+xml
None.
charset
This parameter has identical semantics to thecharset
parameter of theapplication/xml
media type as specified in [RFC3023] or itssuccessor.
By virtue of EMMA content being XML, it has the sameconsiderations when sent as "application/emma+xml
"asdoes XML. See RFC 3023 (or its successor), section 3.2.
Several features of EMMA require dereferencing arbitrary URIs.Implementers are advised to heed the security issues of [RFC3986] section 7.
In addition, because of the extensibility features for EMMA, itis possible that "application/emma+xml
" will describecontent that has security implications beyond those described here.However, if the processor follows only the normative semantics ofthis specification, this content will be ignored. Only in the casewhere the processor recognizes and processes the additionalcontent, or where further processing of that content is dispatchedto other processors, would security issues potentially arise. Andin that case, they would fall outside the domain of thisregistration document.
This specification describes processing semantics that dictatethe required behavior for dealing with, among other things,unrecognized elements.
Because EMMA is extensible, conformant"application/emma+xml
" processors MAY expect thatcontent received is well-formed XML, but processors SHOULD NOTassume that the content is valid EMMA or expect to recognize all ofthe elements and attributes in the document.
This media type registration is extracted from Appendix B of the"EMMA: Extensible MultiModal Annotation markup language"specification.
There is no single initial octet sequence that is always presentin EMMA documents.
EMMA documents are most often identified with the extensions".emma
".
TEXT
Kazuyuki Ashimura, <ashimura@w3.org>.
COMMON
The EMMA specification is a work product of the World Wide WebConsortium's Multimodal Interaction Working Group. The W3C haschange control over these specifications.
emma:hook
and SRGSThis section isInformative.
One of the most powerful aspects of multimodal interfaces istheir ability to provide support for user inputs which aredistributed over the available input modes. Thesecompositeinputs are contributions made by the user within a single turnwhich have component parts in different modes. For example, theuser might say "zoom in here" in the speech mode while drawing anarea on a graphical display in the ink mode. One of the centralmotivating factors for this kind of input is that different kindsof communicative content are best suited to different input modes.In the example of a user drawing an area on a map and saying "zoomin here", the zoom command is easiest to provide in speech but thespatial information, the specific area, is easier to provide inink.
Enabling composite multimodality is critical in ensuring thatmultimodal systems support more natural and effective interactionfor users. In order to support composite inputs, a multimodalarchitecture must provide some kind of multimodal integrationmechanism. In the W3C Multimodal Interaction Framework[MMI Framework], multimodalintegration can be handled by an integration component whichfollows the application of speech understanding and other kinds ofinterpretation procedures for individual modes.
Given the broad range of different techniques being employed formultimodal integration and the extent to which this is an ongoingresearch problem, standardization of the specific method oralgorithm used for multimodal integration is not appropriate atthis time. In order to facilitate the development andinter-operation of different multimodal integration mechanisms EMMAprovides markup language enabling application independentspecification of elements in the application markup where contentfrom another mode needs to be integrated. These representation'hooks' can then be used by different kinds of multimodalintegration components and algorithms to drive the process ofmultimodal integration. In the processing of a composite multimodalinput, the result of applying a mode-specific interpretationcomponent to each of the individual modes will be EMMA markupdescribing the possible interpretation of that input.
One way to build an EMMA representation of a spoken input suchas "zoom in here" is to use grammar rules in the W3C SpeechRecognition Grammar Specification [SRGS] usingthe Semantic Interpretation[SISR]tags to build the application semantics with theemma:hook
attribute. In this approach[ECMAScript] is specified in order to buildup an object representing the semantics. The resulting ECMAScriptobject is then translated to XML.
For our example case of "zoom in here". The following SRGS rulecould be used. TheSemantic Interpretation for SpeechRecognition specification[SISR] provides a reserved property_nsprefix for indicating the namespace to be used with anattribute.
<rule> zoom in here <tag> $.command = new Object(); $.command.action = "zoom"; $.command.location = new Object(); $.command.location._attributes = new Object(); $.command.location._attributes.hook = new Object(); $.command.location._attributes.hook._nsprefix = "emma"; $.command.location._attributes.hook._value = "ink"; $.command.location.type = "area"; </tag></rule>
Application of this rule will result in the following ECMAScriptobject being built.
command: { action: "zoom" location: { _attributes: { hook: { _nsprefix: "emma" _value: "ink" } } type: "area" }}
SI processing in an XML environment wouldgenerate the following document:
<command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location></command>
This XML fragment might then appear within an EMMA document asfollows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="acoustic" emma:mode="voice"> <command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command> </emma:interpretation></emma:emma>
Theemma:hook
annotation indicates that this speechinput needs to be combined with ink input such as thefollowing:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="tactile" emma:mode="ink"> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </emma:interpretation></emma:emma>
This representation could be generated by a pen modalitycomponent performing gesture recognition and interpretation. Theinput to the component would be anInk Markup Languagespecification[INKML] of the inktrace and the output would be the EMMA document above.
The combination will result in the following EMMA document forthe combined speech and pen multimodal input.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="acoustic tactile" emma:mode="voice ink" emma:process="http://example.com/myintegrator.xml"> <emma:derived-from resource="http://example.com/voice1.emma/#voice1" composite="true"/> <emma:derived-from resource="http://example.com/pen1.emma/#pen1" composite="true"/> <command> <action>zoom</action> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </command> </emma:interpretation></emma:emma>
There are two components to the process of integrating these twopieces of semantic markup. The first is to ensure that the two arecompatible; that is, that no semantic constraints are violated. Thesecond is to fuse the content from the two sources. In our example,the<type>area</type>
element is intendedto indicate that this speech command requires integration with anarea gesture rather than, for example, a line gesture, which wouldhave the subelement<type>line</type>
.This constraint needs to be enforced by whatever mechanism isresponsible for multimodal integration.
Many different techniques could be used for achieving thisintegration of the semantic interpretation of the pen input, a<location>
element, with the corresponding<location>
element in the speech. Theemma:hook
simply serves to indicate theexistence of this relationship.
One way to achieve both the compatibility checking and fusion ofcontent from the two modes is to use a well-defined general purposematching mechanism such as unification.Graph unification[Graphunification] is a mathematical operation definedover directed acylic graphs which captures both of the componentsof integration in a single operation: the applications of thesemantic constraints and the fusing of content. One possiblesemantics for theemma:hook
markup indicates thatcontent from the required mode needs to be unified with thatposition in the application semantics. In order to unify, twoelements must not have any conflicting values for subelements orattributes. This procedure can be defined recursively so thatelements within the subelements must also not clash and so on. Theresult of unification is the union of all of the elements andattributes of the two elements that are being unified.
In addition to the unification operation, in the resultingemma:interpretation
theemma:hook
attribute needs to be removed and theemma:mode
attribute changed tothe list of the modes of the individualinputs, e.g."voice ink"
.
Instead of the unification operation, for a specific applicationsemantics, integration could be achieved using some other algorithmor script. The benefit of using the unification semantics foremma:hook
is that it provides a general purposemechanism for checking the compatibility of elements and fusingthem, whatever the specific elements are in the applicationspecific semantic representation.
The benefit of using theemma:hook
annotation forauthors is that it provides an application independent method forindicating where integration with content from another mode isrequired. If a general purpose integration mechanism is used, suchas the unification approach described above, authors should be ableto use the same integration mechanism for a range of differentapplications without having to change the integration rules orlogic. For each application the speech grammar rules [SRGS] need to assignemma:hook
to theappropriate elements in the semantic representation of the speech.The general purpose multimodal integration mechanism will use theemma:hook
annotations in order to determine where toadd in content from other modes. Another benefit of theemma:hook
mechanism is that it facilitatesinteroperability among different multimodal integration components,so long as they are all general purpose and utilizeemma:hook
in order to determine where to integratecontent.
The following provides a more detailed example of the use of theemma:hook
annotation. In this example, spoken input iscombined with twoink gestures. The semanticrepresentation assigned to the spoken input "send this file tothis" indicates two locations where content is required from inkinput usingemma:hook="ink"
:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:interpretation emma:medium="acoustic" emma:mode="voice" emma:tokens="send this file to this" emma:start="1087995961500" emma:end="1087995963542"> <command> <action>send</action> <arg1> <object emma:hook="ink"> <type>file</type> <number>1</number> </object> </arg1> <arg2> <object emma:hook="ink"> <number>1</number> </object> </arg2> </command> </emma:interpretation></emma:emma>
The user gesturing on the two locations on the display can berepresented usingemma:sequence
:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"> <emma:sequence> <emma:interpretationemma:start="1087995960500" emma:end="1087995960900"
emma:medium="tactile" emma:mode="ink"> <object> <type>file</type> <number>1</number> <id>test.pdf</id> <object> </emma:interpretation> <emma:interpretationemma:start="1087995961000" emma:end="1087995961100"
emma:medium="tactile" emma:mode="ink"> <object> <type>printer</type> <number>1</number> <id>lpt1</id> <object> </emma:interpretation> </emma:sequence></emma:emma>
A general purpose unification-based multimodal integrationalgorithm could use theemma:hook
annotation asfollows. It identifies the elements marked withemma:hook
in document order. For each of those inturn, it attempts to unify the element with the correspondingelement in order in theemma:sequence
. Since none ofthe subelements conflict, the unification goes through and as aresult, we have the following EMMA for the composite result:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" xmlns="http://www.example.com/example"><emma:interpretation emma:medium="acoustic tactile" emma:mode="voice ink" emma:tokens="send this file to this" emma:process="http://example.com/myintegration.xml" emma:start="1087995960500" emma:end="1087995963542"> <emma:derived-from resource="http://example.com/voice2.emma/#voice2" composite="true"/> <emma:derived-from resource="http://example.com/ink2.emma/#ink2" composite="true"/> <command> <action>send</action> <arg1> <object> <type>file</type> <number>1</number> <id>test.pdf</id> </object> </arg1> <arg2> <object> <type>printer</type> <number>1</number> <id>lpt1</id> </object> </arg2> </command></emma:interpretation></emma:emma>
This section isInformative.
The W3C Document Object Model [DOM] definesplatform and language neutral interfaces that gives programs andscripts the means to dynamically access and update the content,structure and style of documents. DOM Events define a generic eventsystem which allows registration of event handlers, describes eventflow through a tree structure, and provides basic contextualinformation for each event.
This section of the EMMA specification extends the DOM Eventinterface for use with events that describe interpreted user inputin terms of a DOM Node for an EMMA document.
// File: emma.idl#ifndef _EMMA_IDL_#define _EMMA_IDL_#include "dom.idl"#include "views.idl"#include "events.idl"#pragma prefix "dom.w3c.org"module emma{ typedef dom::DOMString DOMString; typedef dom::Node Node; interface EMMAEvent : events::UIEvent { readonly attribute dom::Node node; void initEMMAEvent(in DOMString typeArg, in boolean canBubbleArg, in boolean cancelableArg, in Node node); };};#endif // _EMMA_IDL_
This section isInformative.
Since the publication of the Proposed Recommendation of the EMMAspecification, the following minor editorial changes have beenadded to the draft.
This section isInformative.
The editors would like to recognize the contributions of thecurrent and former members of the W3C Multimodal Interaction Group(listed in alphabetical order):