Movatterモバイル変換


[0]ホーム

URL:


W3C

Speech Synthesis Markup Language (SSML) Version 1.0

W3C Recommendation 7 September 2004

This version:
http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
Latest version:
http://www.w3.org/TR/speech-synthesis/
Previous version:
http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/

Editors:
Daniel C. Burnett, Nuance Communications
Mark R. Walker, Intel
Andrew Hunt, ScanSoft

Please refer to theerratafor this document, which may include some normative corrections.

See alsotranslations.

Copyright ©1999 - 2004W3C® (MIT ,ERCIM ,Keio), All Rights Reserved. W3Cliability,trademark,document use rules apply.


Abstract

The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in theW3C technical reports index at http://www.w3.org/TR/.

This document contains the Speech Synthesis Markup Language (SSML) 1.0specification and is aW3C Recommendation. It has been produced as part of theVoice Browser Activity.The authors of this document are participants in theVoiceBrowser Working Group (W3C members only).For more information see theVoice Browser FAQ. This is a stable document and has been endorsed by the W3C Membershipand the participants of the Voice Browser working group.

The design of SSML 1.0 has been widely reviewed (see thedisposition of comments) and satisfies the Working Group'stechnical requirements. A list of implementations is included in theSSML 1.0 ImplementationReport, along with the associated test suite.

Comments are welcome onwww-voice@w3.org (archive).SeeW3C mailing list and archive usageguidelines.

Patent disclosures relevant to this specification may be found on theWorking Group'spatent disclosure page. This document has been produced under the24 January 2002 CPP as amended by theW3C Patent Policy Transition Procedure. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance withsection 6 of the W3C Patent Policy.

0. Table of Contents

1. Introduction

This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].

SSML is part of a larger set of markup specifications forvoice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [SABLE], which tried to integrate many different XML-based markups forspeech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS]. Since then, SABLE itself has not undergone any further development.

The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (seeSection 1.2). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (seeSection 2.2.2) or as part of a fragment (seeSection 2.2.1) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features likephoneme andprosody (e.g. for speech contour design) may require specialized knowledge.

1.1 Design Concepts

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS].

The following items were the key design criteria.

1.2 Speech Synthesis Process Steps

AText-To-Speech system (asynthesis processor) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to thesynthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.

Document processing: The following are the six major processing steps undertaken by asynthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.

  1. XML parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps. Tokens (words) in SSML cannot span markup tags. A simple English example is "cup<break/>board"; thesynthesis processor will treat this as the two words "cup" and "board" rather than as one word with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.

  2. Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

  3. Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of thesynthesis processor that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit.

  4. Text-to-phoneme conversion: Once thesynthesis processor has determined the set of words to be spoken, it must derive pronunciations for each word. Word pronunciations may be conveniently described as sequences of phonemes, which are units of sound in a language that serve to distinguish one word from another. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the pronunciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an Englishsynthesis processor will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").

  5. Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.

    While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, thebreak andprosody elements mentioned above operate at a later point in the process and thus must coexist both with uses of theemphasis element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.

  6. Waveform production: The phonemes and prosodic information are used by thesynthesis processor in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by asynthesis processor. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in theprevious section. The following are some of the common cases.

The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.

1.4 Platform-Dependent Output Behavior of SSML Content

SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.

Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that thesynthesis processor cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.

1.5 Terminology


Requirements terms
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

At user option
A conformingsynthesis processor may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.

Error
Results are undefined. A conformingsynthesis processor may detect and report an error and may recover from it.

Media Type
Amedia type (defined in [RFC2045] and [RFC2046]) specifies the nature of a linked resource. Media types are case insensitive. A list of registered media types is available for download [TYPES]. SeeAppendix C for information on media types for SSML.

Speech Synthesis
The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.

Synthesis Processor
AText-To-Speech system that accepts SSML documents as input and renders them as spoken output.

Text-To-Speech
The process of automatic generation of speech output from text or annotated text input.

URI: Uniform Resource Identifier
A URI is a unifying syntax for the expression of names and addresses of objects on the network as used in the World Wide Web. A URI is defined as any legalanyURI primitive as defined in XML Schema Part 2: Datatypes [SCHEMA2 §3.2.17]. For informational purposes only, [RFC2396] and [RFC2732] may be useful in understanding the structure, format, and use of URIs. Any relative URI reference must be resolved according to the rules given inSection 3.1.3.1. In this specification URIs are provided as attributes to elements, for example in theaudio andlexicon elements.

Voice Browser
A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.

2. SSML Documents

2.1 Document Form

A legal stand-alone Speech Synthesis Markup Language document must have a legalXML Prolog [XML §2.8]. If present, the optional DOCTYPE must read as follows:

<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">

The XML prolog is followed by the rootspeak element. SeeSection 3.1.1 for details on this element.

Thespeak element must designate the SSML namespace. This can be achieved by declaring anxmlns attribute or an attribute with an "xmlns" prefix. See [XMLNS §2] for details. Note that when thexmlns attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements. The namespace for SSML is defined to behttp://www.w3.org/2001/10/synthesis.

It is recommended that thespeak element also indicate the location of the SSML schema (seeAppendix D) via thexsi:schemaLocation attribute from [SCHEMA1 §2.6.3]. Although such indication is not required, to encourage it this document provides such indication on all of the examples.

The following are two examples of legal SSML headers:

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">
<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xml:lang="en-US">

Themeta,metadata andlexicon elements must occur before all other elements and text contained within the rootspeak element. There are no other ordering constraints on the elements in this specification.

2.2. Conformance

2.2.1 Conforming Speech Synthesis Markup Language Fragments

A document fragment is aConforming Speech Synthesis Markup Language Fragment if:

2.2.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

A document is aConforming Stand-Alone Speech Synthesis Markup Language Document if it meets both the following conditions:

The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

2.2.3 Using SSML with other Namespaces

The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces.

2.2.4 Conforming Speech Synthesis Markup Language Processors

A Speech Synthesis Markup Language processor is a program that can parse and processConforming Stand-Alone Speech Synthesis Markup Language documents.

In aConforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [XML] and Namespaces in XML [XMLNS]. This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is optional to apply or expand external entity references defined in an external DTD.

A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.

A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of natural (human) languages:

When a Conforming Speech Synthesis Markup Language Processor encounters elements or attributes, other thanxml:lang andxml:base , in a non-synthesis namespace it may:

There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.

2.2.5 Conforming User Agent

AConforming User Agent is aConforming Speech Synthesis Markup Language Processor that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent must support at least one natural language.

Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.

2.3 Integration With Other Markup Languages

2.3.1 SMIL

The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [SMIL] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples inAppendix F.

2.3.2 ACSS

Aural Cascading Style Sheets [CSS2 §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

2.3.3 VoiceXML

The Voice Extensible Markup Language [VXML] enables Web-based development and content-delivery for interactive voice response applications (seevoice browser ). VoiceXML supportsspeech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML seeAppendix F.

2.4 Fetching SSML Documents

The fetching and caching behavior of SSML documents is defined by the environment in which thesynthesis processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.

3. Elements and Attributes

The following elements and attributes are defined in this specification.

3.1 Document Structure, Text Processing and Pronunciation

3.1.1speak Root Element

The Speech Synthesis Markup Language is an XML application. The root element isspeak.xml:lang is a required attribute specifying the language of the root document.xml:base is an optional attribute specifying the BaseURI of the root document. Theversion attribute is a required attribute that indicates the version of the specification to be used for the document and must have the value "1.0".

<?xml version="1.0"?><speak version="1.0"         xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  ... the body ...</speak>

Thespeak element can only contain text to be rendered and the following elements:audio,break,emphasis,lexicon,mark,meta,metadata,p,phoneme,prosody,say-as,sub,s,voice.

3.1.2 Language:xml:lang Attribute

Thexml:lang attribute, as defined by XML 1.0 [XML §2.12], can be used in SSML to indicate the natural language of the enclosing element and its attributes and subelements. RFC 3066 [RFC3066] may be of some use in understanding how to use this attribute.

Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is a defined attribute for thevoice,speak,p, ands elements. For vocal rendering, a language change can have an effect on various other parameters (including gender, speed, age, pitch, etc.) which may be disruptive to the listener. There might even be unnatural breaks between language shifts. For this reason authors are encouraged to use thevoice element to change the language.xml:lang is permitted onp ands only because it is common to change the language at those levels.

Although this attribute is also permitted on thedesc element, none of the voice-change behavior described in this section applies when used with that element.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <p>I don't speak Japanese.</p>  <p xml:lang="ja">日本語が分かりません。</p></speak>

In the case that a document requires speech output in a language not supported by the processor, thesynthesis processor largely determines behavior. Specifyingxml:lang does not imply a change in voice, though this may indeed occur. When a given voice is unable to speak content in the indicated language, a new voice may be selected by the processor. No change in the voice or prosody should occur if thexml:lang value is the same as the inherited value. Further information about voice selection appears inSection 3.2.1.

There may be variation across conforming processors in the implementation ofxml:lang voice changes for different markup elements (e.g.p ands elements).

All elements should process their contents specific to the enclosing language. For instance, thephoneme,emphasis,break,p ands elements should each be rendered in a manner that is appropriate to the current language.

Thetext normalization processing step may be affected by the enclosing language. This is true for both markup support by thesay-as element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.

<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <s>Today, 2/1/2000.</s>  <!-- Today, February first two thousand -->  <s xml:lang="it">Un mese fà, 2/1/2000.</s>  <!-- Un mese fà, il due gennaio duemila -->  <!-- One month ago, the second of January two thousand --></speak>

3.1.3 base URI:xml:base Attribute

RelativeURIs are resolved according to abase URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. SeeSection 3.1.3.1 for details on the resolution of relative URIs.

Thebase URI declaration is permitted but optional. The two elements affected by it are

audio
The optionalsrc attribute can specify a relative URI.
lexicon
Theuri attribute can specify a relative URI.

Thexml:base attribute

The baseURI declaration follows [XML-BASE] and is indicated by anxml:base attribute on the rootspeak element.

<?xml version="1.0"?><speak version="1.0" xml:lang="en-US"         xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:base="http://www.example.com/base-file-path">
<?xml version="1.0"?><speak version="1.0" xml:lang="en-US"         xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:base="http://www.example.com/another-base-file-path">

3.1.3.1 Resolving Relative URIs

User agents must calculate the baseURI for resolving relative URIs according to [RFC2396]. The following describes how RFC2396 applies to synthesis documents.

User agents must calculate the base URI according to the following precedences (highest priority to lowest):

  1. The base URI is set by thexml:base attribute on thespeak element (seeSection 3.1.3).
  2. The base URI is given by metadata discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).
  3. By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). It is anerror if such documents contain relative URIs.

3.1.4 Pronunciation Lexicon:lexicon Element

An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by aURI with an optionalmedia type. No standard lexicon media type has yet been defined as the default for this specification.

The W3C Voice Browser Working Group is developing the Pronunciation Lexicon Markup Language [LEX]. The specification is expected to address the matching process between tokens and lexicon entries and the mechanism by which asynthesis processor handles multiple pronunciations from internal and synthesis-specified lexicons. Pronunciation handling with proprietary lexicon formats will necessarily be specific to thesynthesis processor.

A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document.

Pronunciation lexicons are necessarily language-specific. Pronunciation lookup in a lexicon and pronunciation inference for any token may use an algorithm that is language-specific. As mentioned inSection 1.2, the definition of what constitutes a "token" may itself be language-specific.

When multiple lexicons are referenced, their precedence goes from lower to higher with document order. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if not found in that lexicon, the next lexicon is searched and so on until a first match or until all lexicons have been used for lookup.

Thelexicon element

Any number oflexicon elements may occur as immediate children of thespeak element. Thelexicon element must have auri attribute specifying aURI that identifies the location of the pronunciation lexicon document.

Thelexicon element may have atype attribute that specifies themedia type of the pronunciation lexicon document.

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                 http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">  <lexicon uri="http://www.example.com/lexicon.file"/>  <lexicon uri="http://www.example.com/strange-words.file"           type="media-type"/>  ...</speak>

Details of the type attribute

Note: the description and table that follow use an imaginary vendor-specific lexicon typeof x-vnd.example.lexicon. This is intended to represent whatever format is returned/available, as appropriate.

A lexicon resource indicated by aURI reference may be available in one or moremedia types. The SSML author can specify the preferred media type via thetype attribute. When the content represented by a URI is available in many data formats, asynthesis processor may use the preferred type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the type to order the preferences in the negotiation.

Upon delivery, the resource indicated by a URI reference may be considered in terms of two types. Thedeclared media type is the alleged value for the resource and the actual media type is the true format of its content. Theactual type should be the same as the declared type, but this is not always the case (e.g. a misconfigured HTTP server might returntext/plain for a document following the vendor-specificx-vnd.example.lexicon format). A specific URI scheme may require that the resource owner always, sometimes, or never return a media type. Whenever a type is returned, it is treated as authoritative. The declared media type is determined by the value returned by the resource owner or, if none is returned, by the preferred media type given in the SSML document.

Three special cases may arise. The declared type may not be supported by the processor; this is anerror. The declared type may be supported but the actual type may not match; this is also anerror. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of thesynthesis processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616 §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:

Media type examples
HTTP 1.1 request
Local file access
Media type returned by the resource ownertext/plainx-vnd.example.lexicon<none><none>
Preferred media type from the SSML documentNot applicable; the returned type is authoritative.x-vnd.example.lexicon<none>
Declared media typetext/plainx-vnd.example.lexiconx-vnd.example.lexicon<none>
Behavior for an actual media type of x-vnd.example.lexiconThe must be processed as text/plain. This will generate anerror if text/plain is not supported or if the document does not follow the expected format.The declared and actual types match; success if x-vnd.example.lexicon is supported by the synthesis processor; otherwise anerror.Scheme specific; the synthesis processor might introspect the document to determine the type.

Thelexicon element is an empty element.

3.1.5meta Element

Themetadata andmeta elements are containers in which information about the document can be placed. Themetadata element provides more general and powerful treatment of metadata information thanmeta by using a metadata schema.

Ameta declaration associates a string to a declared meta property or declares "http-equiv" content. Either aname orhttp-equiv attribute is required. It is anerror to provide bothname andhttp-equiv attributes. Acontent attribute is required. TheseeAlso property is the only definedmeta property name. It is used to specify a resource that might provide additional metadata information about the content. This property is modelled on theseeAlso property of Resource Description Framework (RDF) Schema Specification 1.0 [RDF-SCHEMA §5.4.1]. Thehttp-equiv attribute has a special significance when documents are retrieved via HTTP. Although the preferred method of providing HTTP header information is by using HTTP header fields, the "http-equiv" content may be used in situations where the SSML document author is unable to configure HTTP header fields associated with their document on the origin server, for example, cache control information. Note that HTTP servers and caches are not required to introspect the contents ofmeta in SSML documents and thereby override the header values they would send otherwise.

Informative: This is an example of howmeta elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                 http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">       <meta name="seeAlso" content="http://example.com/my-ssml-metadata.xml"/>       <meta http-equiv="Cache-Control" content="no-cache"/></speak>

Themeta element is an empty element.

3.1.6metadata Element

Themetadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used withmetadata, it is recommended that the XML syntax of the Resource Description Framework (RDF) [RDF-XMLSYNTAX] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [DC].

The Resource Description Format [RDF] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-XMLSYNTAX] and [RDF-SCHEMA] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).

Document properties declared with themetadata element can use any metadata schema.

Informative: This is an example of howmetadata can be included in an SSML document using the Dublin Core version 1.0 RDF schema [DC] describing general document information such as title, description, date, and so on:

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                 http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">      <metadata>   <rdf:RDF       xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"       xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"       xmlns:dc = "http://purl.org/dc/elements/1.1/">   <!-- Metadata about the synthesis document -->   <rdf:Description rdf:about="http://www.example.com/meta.ssml"       dc:Title="Hamlet-like Soliloquy"       dc:Description="Aldine's Soliloquy in the style of Hamlet"       dc:Publisher="W3C"       dc:Language="en-US"       dc:Date="2002-11-29"       dc:Rights="Copyright 2002 Aldine Turnbet"       dc:Format="application/ssml+xml" >                       <dc:Creator>          <rdf:Seq>             <rdf:li>William Shakespeare</rdf:li>             <rdf:li>Aldine Turnbet</rdf:li>          </rdf:Seq>       </dc:Creator>   </rdf:Description>  </rdf:RDF> </metadata></speak>

Themetadata element can have arbitrary content, although none of the content will be rendered by thesynthesis processor.

3.1.7 Text Structure:p ands Elements

Ap element represents a paragraph. Ans element represents a sentence.

xml:lang is a defined attribute on thep ands elements.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <p>    <s>This is the first sentence of the paragraph.</s>    <s>Here's another sentence.</s>  </p></speak>

The use ofp ands elements is optional. Where text occurs without an enclosingp ors element thesynthesis processor should attempt to determine the structure using language-specific knowledge of the format of plain text.

Thep element can only contain text to be rendered and the following elements:audio,break,emphasis,mark,phoneme,prosody,say-as,sub,s,voice.

Thes element can only contain text to be rendered and the following elements:audio,break,emphasis,mark,phoneme,prosody,say-as,sub,voice.

3.1.8say-as Element

Thesay-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies thesay-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.

Thesay-as element has three attributes:interpret-as,format, anddetail. Theinterpret-as attribute is always required; the other two attributes are optional. The legal values for theformat attribute depend on the value of theinterpret-as attribute.

Thesay-as element can only contain text to be rendered.

Theinterpret-as andformat attributes

Theinterpret-as attribute indicates the content type of the contained text construct. Specifying the content type helps thesynthesis processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optionalformat attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.

When specified, theinterpret-as andformat values are to be interpreted by thesynthesis processor as hints provided by the markup document author to aidtext normalization and pronunciation.

In all cases, the text enclosed by anysay-as element is intended to be a standard, orthographic form of the language currently in context. Asynthesis processor should be able to support the common, orthographic forms of the specified language for every content type that it supports.

When the value for theinterpret-as attribute is unknown or unsupported by a processor, it must render the contained text as if nointerpret-as value were specified.

When the value for theformat attribute is unknown or unsupported by a processor, it must render the contained text as if noformat value were specified, and should render it using theinterpret-as value that is specified.

When the content of thesay-as element contains additional text next to the content that is in the indicatedformat andinterpret-as type, then this additional text must be rendered. The processor may make the rendering of the additional text dependent on theinterpret-as type of the element in which it appears.
When the content of thesay-as element contains no content in the indicatedinterpret-as type orformat, the processor must render the content either as if theformat attribute were not present, or as if theinterpret-as attribute were not present, or as if neither theformat norinterpret-as attributes were present. The processor should also notify the environment of the mismatch.

Indicating the content type or format does not necessarily affect the way the information is pronounced. Asynthesis processor should pronounce the contained text in a manner in which such content is normally produced for the language.

Thedetail attribute

Thedetail attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of thedetail attribute must render all of the informational content in the contained text; however, specific values for thedetail attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, asynthesis processor will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.

Thedetail attribute can be used for allinterpret-as types.

If thedetail attribute is not specified, the level of detail that is produced by thesynthesis processor depends on the text content and the language.

When the value for thedetail attribute is unknown or unsupported by a processor, it must render the contained text as if no value were specified for thedetail attribute.

3.1.9phoneme Element

Thephoneme element provides a phonemic/phonetic pronunciation for the contained text. Thephoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

Theph attribute is a required attribute that specifies the phoneme/phone string.

This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergotext normalization and is not treated as a token for lookup in the lexicon (seeSection 3.1.4), while values insay-as andsub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.

Thealphabet attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "ipa" (see the next paragraph) and vendor-defined strings of the form "x-organization" or "x-organization-alphabet". For example, the Japan Electronics and Information Technology Industries Association [JEITA] might wish to encourage the use of an alphabet such as "x-JEITA" or "x-JEITA-2000" for their phoneme alphabet [JEIDAALPHABET].

Synthesis processors should support a value foralphabet of "ipa", corresponding to Unicode representations of the phonetic characters developed by the International Phonetic Association [IPA]. In addition to an exhaustive set of vowel and consonant symbols, this character set supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more. For this alphabet, legalph values are strings of the values specified in Appendix 2 of [IPAHNDBK]. Informative tables of the IPA-to-Unicode mappings can be found at [IPAUNICODE1] and [IPAUNICODE2]. Note that not all of the IPA characters are available in Unicode. For processors supporting this alphabet,

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>  <!-- This is an example of IPA using character entities -->  <!-- Because many platform/browser/text editor combinations do not       correctly cut and paste Unicode text, this example uses the entity       escape versions of the IPA characters.  Normally, one would directly       use the UTF-8 representation of these symbols: "təmei̥ɾou̥". --></speak>

It is anerror if a value foralphabet is specified that is not known or cannot be applied by asynthesis processor. The default behavior when thealphabet attribute is left unspecified is processor-specific.

Thephoneme element itself can only contain text (no elements).

3.1.10sub Element

Thesub element is employed to indicate that the text in thealias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The requiredalias attribute specifies the string to be spoken instead of the enclosed string. The processor should applytext normalization to thealias value.

Thesub element can only contain text (no elements).

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <sub alias="World Wide Web Consortium">W3C</sub>  <!-- World Wide Web Consortium --></speak>

3.2 Prosody and Style

3.2.1voice Element

Thevoice element is a production element that requests a change in speaking voice. Attributes are:

Although each attribute individually is optional, it is anerror if no attributes are specified when thevoice element is used.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">     <voice gender="female">Mary had a little lamb,</voice>  <!-- now request a different female child's voice -->  <voice gender="female" variant="2">  Its fleece was white as snow.  </voice>  <!-- processor-specific voice selection -->  <voice name="Mike">I want to be like Mike.</voice></speak>

Thevoice element is commonly used to change the language. When there is not a voice available that exactly matches the attributes specified in the document, or there are multiple voices that match the criteria, the following voice selection algorithm must be used. There are cases in the algorithm that are ambiguous; in such cases voice selection may be processor-specific. Approximately speaking, thexml:lang attribute has the highest priority and all other attributes are equal in priority but belowxml:lang . The complete algorithm is:

  1. If a voice is available for a requestedxml:lang , asynthesis processor must use it. If there are multiple such voices available, the processor should use the voice that best matches the specified values forname,variant,gender andage.
  2. If there is no voice available for the requestedxml:lang , the processor should select a voice that is closest to the requested language (e.g. a variant or dialect of the same language). If there are multiple such voices available, the processor should use a voice that best matches the specified values forname,variant,gender andage.
  3. It is anerror if the processor decides it does not have a voice that sufficiently matches the above criteria.

Note that simple cases of foreign-text embedding (where a voice change is not needed or undesirable) can be done. SeeAppendix F for examples.

voice attributes are inherited down the tree including to within elements that change the language.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <voice gender="female">     Any female voice here.    <voice age="6">       A female child voice here.      <p xml:lang="ja">         <!-- A female child voice in Japanese. -->      </p>    </voice>  </voice></speak>

Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.

The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.

Thevoice element can only contain text to be rendered and the following elements:audio,break,emphasis,mark,p,phoneme,prosody,say-as,sub,s,voice.

3.2.2emphasis Element

Theemphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). Thesynthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  That is a <emphasis> big </emphasis> car!  That is a <emphasis level="strong"> huge </emphasis>  bank account!</speak>

Theemphasis element can only contain text to be rendered and the following elements:audio,break,emphasis,mark,phoneme,prosody,say-as,sub,voice.

3.2.3break Element

Thebreak element is an empty element that controls the pausing or other prosodic boundaries between words. The use of thebreak element between any pair of words is optional. If the element is not present between words, thesynthesis processor is expected to automatically determine a break based on the linguistic context. In practice, thebreak element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:

Thestrength attribute is used to indicate the prosodic strength of the break. For example, the breaks between paragraphs are typically much stronger than the breaks between words within a sentence. Thesynthesis processor may insert a pause as part of its implementation of the prosodic break. A pause of a specific length can also be inserted by using thetime attribute.

If abreak element is used with neitherstrength nortime attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if nobreak element was supplied.

If bothstrength andtime attributes are supplied, the processor will insert a break with a duration as specified by thetime attribute, with other prosodic changes in the output based on the value of thestrength attribute.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  Take a deep breath <break/>  then continue.   Press 1 or wait for the tone. <break time="3s"/>  I didn't hear you! <break strength="weak"/> Please repeat.</speak>

3.2.4prosody Element

Theprosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:

Although each attribute individually is optional, it is anerror if no attributes are specified when theprosody element is used. The "x-foo " attribute value names are intended to be mnemonics for "extrafoo". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.

Number

A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.

Relative values

Relative changes for the attributes above can be specified

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  The price of XYZ is <prosody rate="-10%">$45</prosody></speak>

Pitch contour

The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form(time position,target), the first value is a percentage of the period of the contained text (anumber followed by "%") and the second value is the value of thepitch attribute (anumber followed by "Hz", arelative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">  <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)">    good morning  </prosody></speak>

Theduration attribute takes precedence over therate attribute. Thecontour attribute takes precedence over thepitch andrange attributes.

The default value of all prosodic attributes is no change. For example, omitting therate attribute means that the rate is the same within the element as outside.

Theprosody element can only contain text to be rendered and the following elements:audio,break,emphasis,mark,p,phoneme,prosody,say-as,sub,s,voice.

Limitations

All prosodic attribute values are indicative. If asynthesis processor is unable to accurately render a document as specified (e.g., trying to set the pitch to 1Mhz or the speaking rate to 1,000,000 words per minute), it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.

In some cases,synthesis processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.

3.3 Other Elements

3.3.1audio Element

Theaudio element supports the insertion of recorded audio files (seeAppendix A for required formats) and the insertion of other audio formats in conjunction with synthesized speech output. Theaudio element may be empty. If theaudio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup,desc elements, or otheraudio elements. The alternate content may also be used when rendering the document to non-audible output and for accessibility (see thedesc element). The required attribute issrc, which is theURI of a document with an appropriate MIME type.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">                   <!-- Empty element -->  Please say your name after the tone.  <audio src="beep.wav"/>  <!-- Container element with alternative text -->  <audio src="prompt.au">What city do you want to fly from?</audio>  <audio src="welcome.wav">      <emphasis>Welcome</emphasis> to the Voice Portal.   </audio></speak>

Anaudio element is successfully rendered:

  1. If the referenced audio source is played, or
  2. If thesynthesis processor is unable to execute #1 but the alternative content is successfully rendered, or
  3. If the processor can detect that text-only output is required and the alternative content is successfully rendered.

Deciding which conditions result in the alternative content being rendered is processor-dependent. If theaudio element is not successfully rendered, asynthesis processor should continue processing and should notify the hosting environment. The processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.

Theaudio element can only contain text to be rendered and the following elements:audio,break,desc,emphasis,mark,p,phoneme,prosody,say-as,sub,s,voice.

3.3.2mark Element

Amark element is an empty element that places a marker into the text/tag sequence. It has one required attribute,name, which is of typexsd:token [SCHEMA2 §3.3.2]. Themark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing amark element, asynthesis processor must do one or both of the following:

Themark element does not affect the speech output process.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">                 Go from <mark name="here"/> here, to <mark name="there"/> there!</speak>

3.3.3desc Element

Thedesc element can only occur within the content of theaudio element. When the audio source referenced inaudio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain adesc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by thesynthesis processor, the content of thedesc element(s) should be rendered instead of other alternative content inaudio. The optionalxml:lang attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses ofxml:lang in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">                   <!-- Normal use of <desc> -->  Heads of State often make mistakes when speaking in a foreign language.  One of the most well-known examples is that of John F. Kennedy:  <audio src="ichbineinberliner.wav">If you could hear it, this would be  a recording of John F. Kennedy speaking in Berlin.    <desc>Kennedy's famous German language gaffe</desc>  </audio>  <!-- Suggesting the language of the recording -->  <!-- Although there is no requirement that a recording be in the current language       (since it might even be non-speech such as music), an author might wish to       suggest the language of the recording by marking the entire <audio> element       using <voice>.  In this case, the xml:lang attribute on <desc> can be used       to put the description back into the original language. -->  Here's the same thing again but with a different fallback:  <voice xml:lang="de-DE">    <audio src="ichbineinberliner.wav">Ich bin ein Berliner.      <desc xml:lang="en-US">Kennedy's famous German language gaffe</desc>    </audio>  </voice></speak>

Thedesc element can only contain descriptive text.

4. References

4.1 Normative References

[CSS2]
Cascading Style Sheets, level 2: CSS2 Specification, B. Bos, et al., Editors. World Wide Web Consortium, 12 May 1998. This version of the CSS2 Recommendation is http://www.w3.org/TR/1998/REC-CSS2-19980512/. Thelatest version of CSS2 is available at http://www.w3.org/TR/REC-CSS2/.
[IPAHNDBK]
Handbook of the International Phonetic Association, International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at http://www.arts.gla.ac.uk/ipa/handbook.html.
[RFC1521]
MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies, N. Borenstein and N. Freed, Editors. IETF, September 1993. This RFC is available at http://www.ietf.org/rfc/rfc1521.txt.
[RFC2045]
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies., N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2045.txt.
[RFC2046]
Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2046.txt.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner, Editor. IETF, March 1997. This RFC is available at http://www.ietf.org/rfc/rfc2119.txt.
[RFC2396]
Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee et al., Editors. IETF, August 1998. This RFC is available at http://www.ietf.org/rfc/rfc2396.txt.
[RFC3066]
Tags for the Identification of Languages, H. Alvestrand, Editor. IETF, January 2001. This RFC is available at http://www.ietf.org/rfc/rfc3066.txt.
[SCHEMA1]
XML Schema Part 1: Structures, H. S. Thompson, et al., Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 1 Recommendation is http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. Thelatest version of XML Schema 1 is available at http://www.w3.org/TR/xmlschema-1/.
[SCHEMA2]
XML Schema Part 2: Datatypes, P.V. Biron and A. Malhotra, Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 2 Recommendation is http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/. Thelatest version of XML Schema 2 is available at http://www.w3.org/TR/xmlschema-2/.
[TYPES]
MIME Media types, IANA. This continually-updated list of media types registered with IANA is available at http://www.iana.org/assignments/media-types/index.html.
[XML]
Extensible Markup Language (XML) 1.0 (Second Edition), T. Bray et al., Editors. World Wide Web Consortium, 6 October 2000. This version of the XML 1.0 Recommendation is http://www.w3.org/TR/2000/REC-xml-20001006. Thelatest version of XML 1.0 is available at http://www.w3.org/TR/REC-xml.
[XML-BASE]
XML Base, J. Marsh, Editor. World Wide Web Consortium, 27 June 2001. This version of the XML Base Recommendation is http://www.w3.org/TR/2001/REC-xmlbase-20010627/. Thelatest version of XML Base is available at http://www.w3.org/TR/xmlbase/.
[XMLNS]
Namespaces in XML, T. Bray et al., Editors. World Wide Web Consortium, 14 January 1999. This version of the XML Namespaces Recommendation is http://www.w3.org/TR/1999/REC-xml-names-19990114/. Thelatest version of XML Namespaces is available at http://www.w3.org/TR/REC-xml-names/.

4.2 Informative References

[DC]
Dublin Core Metadata Initiative. Seehttp://dublincore.org/
[HTML]
HTML 4.01 Specification, D. Raggett et al., Editors. World Wide Web Consortium, 24 December 1999. This version of the HTML 4 Recommendation is http://www.w3.org/TR/1999/REC-html401-19991224/. Thelatest version of HTML 4 is available at http://www.w3.org/TR/html4/.
[IPA]
International Phonetic Association. See http://www.arts.gla.ac.uk/ipa/ipa.html for the organization's website.
[IPAUNICODE1]
The International Phonetic Alphabet, J. Esling. This table of IPA characters in Unicode is available at http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm.
[IPAUNICODE2]
The International Phonetic Alphabet in Unicode, J. Wells. This table of Unicode values for IPA characters is available at http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.
[JEIDAALPHABET]
JEIDA-62-2000 Phoneme Alphabet. JEITA. An abstract of this document (in Japanese) is available at http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf.
[JEITA]
Japan Electronics and Information Technology Industries Association. See http://www.jeita.or.jp/.
[JSML]
JSpeech Markup Language, A. Hunt, Editor. World Wide Web Consortium, 5 June 2000. Copyright ©2000 Sun Microsystems, Inc. This version of the JSML submission is http://www.w3.org/TR/2000/NOTE-jsml-20000605/. Thelatest W3C Note of JSML is available at http://www.w3.org/TR/jsml/.
[LEX]
Pronunciation Lexicon Markup Requirements, F. Scahill, Editor. World Wide Web Consortium, 12 March 2001. This document is a work in progress. This version of the Lexicon Requirements is http://www.w3.org/TR/2001/WD-lexicon-reqs-20010312/. Thelatest version of the Lexicon Requirements is available at http://www.w3.org/TR/lexicon-reqs/.
[RDF]
RDF Primer, F. Manola and E. Miller, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Primer Recommendation is http://www.w3.org/TR/2004/REC-rdf-primer-20040210/. Thelatest version of the RDF Primer is available at http://www.w3.org/TR/rdf-primer/.
[RDF-XMLSYNTAX]
RDF/XML Syntax Specification, D. Beckett, Editor. World Wide Web Consortium, 10 February 2004. This version of the RDF/XML Syntax Recommendation is http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/. Thelatest version of the RDF XML Syntax is available at http://www.w3.org/TR/rdf-syntax-grammar/.
[RDF-SCHEMA]
RDF Vocabulary Description Language 1.0: RDF Schema, D. Brickley and R. Guha, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Schema Recommendation is http://www.w3.org/TR/2004/REC-rdf-schema-20040210/. Thelatest version of RDF Schema is available at http://www.w3.org/TR/rdf-schema/.
[REQS]
Speech Synthesis Markup Requirements for Voice Markup Languages, A. Hunt, Editor. World Wide Web Consortium, 23 December 1999. This document is a work in progress. This version of the Synthesis Requirements is http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/. Thelatest version of the Synthesis Requirements is available at http://www.w3.org/TR/voice-tts-reqs/.
[RFC2616]
Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding, et al., Editors. IETF, June 1999. This RFC is available at http://www.ietf.org/rfc/rfc2616.txt.
[RFC2732]
Format for Literal IPv6 Addresses in URL's, R. Hinden, et al., Editors. IETF, December 1999. This RFC is available at http://www.ietf.org/rfc/rfc2732.txt.
[SABLE]
"SABLE: A Standard for TTS Markup", Richard Sproat, et al.Proceedings of the International Conference on Spoken Language Processing, R. Mannell and J. Robert-Ribes, Editors.Causal Productions Pty Ltd (Adelaide), 1998. Vol. 5, pp. 1719-1722. Conference proceedings are available from the publisher at http://www.causalproductions.com/.
[SMIL]
Synchronized Multimedia Integration Language (SMIL 2.0), J. Ayars, et al., Editors. World Wide Web Consortium, 7 August 2001. This version of the SMIL 2 Recommendation is http://www.w3.org/TR/2001/REC-smil20-20010807/. Thelatest version of SMIL2 is available at http://www.w3.org/TR/smil20/.
[UNICODE]
The Unicode Standard. The Unicode Consortium. Information about the Unicode Standard and its versions can be found at http://www.unicode.org/standard/standard.html.
[VXML]
Voice Extensible Markup Language (VoiceXML) Version 2.0, S. McGlashan, et al., Editors. World Wide Web Consortium, 16 March 2004. This version of the VoiceXML 2.0 Recommendation is http://www.w3.org/TR/2004/REC-voicexml20-20040316/. Thelatest version of VoiceXML 2 is available at http://www.w3.org/TR/voicexml20/.

5. Acknowledgments

This document was written with the participation of the following participants in the W3C Voice Browser Working Group(listed in alphabetical order):

Paolo Baggia, Loquendo
Dan Burnett, Nuance
Dave Burke, Voxpilot
Jerry Carter, Independent Consultant
Sasha Caskey, IBM
Brian Eberman, ScanSoft
Andrew Hunt, ScanSoft
Jim Larson, Intel
Bruce Lucas, IBM
Scott McGlashan, HP
T.V. Raman, IBM
Dave Raggett, W3C/Canon
Laura Ricotti, Loquendo
Richard Sproat, ATT
Luc Van Tichelen, ScanSoft
Mark Walker, Intel
Kuansan Wang, Microsoft
Dave Wood, Microsoft

Appendix A: Audio File Formats

This appendix is normative.

SSML requires that a platform support the playing of the audio formats specified below.

Required audio formats
Audio FormatMedia Type
Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711)audio/basic (from [RFC1521])
Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711)audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel.audio/x-wav
WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel.audio/x-wav

The 'audio/basic' MIME type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this MIME type is specified for playing, the mu-law format must be used. For playback with the 'audio/basic' MIME type, processors must support the mu-law format and may support the 'au' format.

Appendix B: Internationalization

This appendix is normative.

SSML is an application of XML 1.0 [XML] and thus supports [UNICODE] which defines a standard universal character set.

SSML provides a mechanism for control of the spoken language via the use of thexml:lang attribute. Language changes can occur as frequently as per word, although excessive language changes can diminish the output audio quality. SSML also permits finer control over output pronunciations via thelexicon andphoneme elements, features that can help to mitigate poor quality default lexicons for languages with only minimal commercial support today.

Appendix C: MIME Types and File Suffix

This appendix is normative.

The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".

The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents wherespeak is the root element.

Appendix D: Schema for the Speech Synthesis Markup Language

This appendix is normative.

The synthesis schema is located athttp://www.w3.org/TR/speech-synthesis/synthesis.xsd.

Note: the synthesis schema includes a no-namespace core schema, located athttp://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments (Sec. 2.2.1) embedded in non-synthesis namespace schemas.

Appendix E: DTD for the Speech Synthesis Markup Language

This appendix is informative.

The SSML DTD is located athttp://www.w3.org/TR/speech-synthesis/synthesis.dtd.

Due to DTD limitations, the SSML DTD does not correctly express that themetadata element can contain elements from other XML namespaces.

Appendix F: Example SSML

This appendix is informative.

The following is an example of reading headers of email messages. Thep ands elements are used to mark the text structure. Thebreak element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. Theprosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">  <p>    <s>You have 4 new messages.</s>    <s>The first is from Stephanie Williams and arrived at <break/> 3:45pm.    </s>    <s>      The subject is <prosody rate="-20%">ski trip</prosody>    </s>  </p></speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">  <p>    <voice gender="male">      <s>Today we preview the latest romantic music from Example.</s>      <s>Hear what the Software Reviews said about Example's newest hit.</s>    </voice>  </p>  <p>    <voice gender="female">      He sings about issues that touch us all.    </voice>  </p>  <p>    <voice gender="male">      Here's a sample.  <audio src="http://www.example.com/music.wav"/>      Would you like to buy it?    </voice>  </p></speak>

It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via thevoice element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.

<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">    The title of the movie is:  "La vita è bella"  (Life is beautiful),  which is directed by Roberto Benigni.</speak>

With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (seeSection 3.1.4) or via thephoneme element as shown in the next example.

It is worth noting that IPA alphabet support is an optional feature and that phonemes for an external language may be rendered with some approximation (seeSection 3.1.4 for details). The following example only uses phonemes common to US English.

<?xml version="1.0" encoding="ISO-8859-1"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"         xml:lang="en-US">    The title of the movie is:   <phoneme alphabet="ipa"    ph="&#x2C8;l&#x251; &#x2C8;vi&#x2D0;&#x27E;&#x259; &#x2C8;&#x294;e&#x26A; &#x2C8;b&#x25B;l&#x259;">   La vita è bella </phoneme>  <!-- The IPA pronunciation isˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə -->  (Life is beautiful),   which is directed by   <phoneme alphabet="ipa"    ph="&#x279;&#x259;&#x2C8;b&#x25B;&#x2D0;&#x279;&#x27E;o&#x28A; b&#x25B;&#x2C8;ni&#x2D0;nji">   Roberto Benigni </phoneme>  <!-- The IPA pronunciation isɹəˈbɛːɹɾoʊ bɛˈniːnji -->  <!-- Note that in actual practice an author might change the     encoding to UTF-8 and directly use the Unicode characters in     the document rather than using the escapes as shown.     The escaped values are shown for ease of copying. --></speak>

SMIL Integration Example

The SMIL language [SMIL] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.

File'greetings.ssml' contains the following:

<?xml version="1.0"?><!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"><speak version="1.0"       xmlns="http://www.w3.org/2001/10/synthesis"       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"       xml:lang="en-US">  <s>    <mark name="greetings"/>    <emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>!  </s></speak>

SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">  <head>    <top-layout width="640" height="320">      <region width="640" height="320"/>    </top-layout>  </head>  <body>    <par>      <img src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s"/>      <ref src="greetings.ssml" begin="1s"/>    </par>  </body></smil>

SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">  <head>    <top-layout width="640" height="320">      <region width="640" height="320"/>    </top-layout>  </head>  <body>    <seq>      <img src="http://w3clogo.gif" region="whole" begin="0s" end="logo.activateEvent"/>      <ref src="greetings.ssml"/>    </seq>  </body></smil>

VoiceXML Integration Example

The following is an example of SSML in VoiceXML (seeSection 2.3.3) forvoice browser applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [VXML] for details.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="http://www.w3.org/2001/vxml    http://www.w3.org/TR/voicexml20/vxml.xsd">   <form>      <block>         <prompt>           <emphasis>Welcome</emphasis> to the Bird Seed Emporium.           <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>           We have 250 kilogram drums of thistle seed for           $299.95           plus shipping and handling this month.           <audio src="http://www.birdsounds.example.com/mourningdove.wav"/>         </prompt>      </block>   </form></vxml>

Appendix G: Summary of changes since the Candidate Recommendation

This is a list of the major changes to the specification since the Candidate Recommendation:

 

Valid XHTML 1.0!

Valid CSS!


[8]ページ先頭

©2009-2025 Movatter.jp