Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Microsoft Speech API

From Wikipedia, the free encyclopedia
(Redirected fromSpeech Application Programming Interface)
Application programming interface for Microsoft Windows
This article is about the Speech API. For other uses, seeSAPI (disambiguation).

TheSpeech Application Programming Interface orSAPI is anAPI developed byMicrosoft to allow the use ofspeech recognition andspeech synthesis withinWindows applications. To date, a number of versions of the API have been released, which have shipped either as part of a SpeechSDK or as part of the WindowsOS itself. Applications that use SAPI includeMicrosoft Office,Microsoft Agent andMicrosoft Speech Server.

In general, all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition andText-To-Speech engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.

In general, the Speech API is a freely redistributable component which can be shipped with any Windows application that wishes to use speech technology. Many versions (although not all) of the speech recognition and synthesis engines are also freely redistributable.

There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5, however, was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.

Basic architecture

[edit]

The Speech API can be viewed as an interface or piece of middleware which sits betweenapplications and speechengines (recognition and synthesis). In SAPI versions 1 to 4, applications could directly communicate with engines. The API included an abstractinterface definition which applications and engines conformed to. Applications could also use simplified higher-level objects rather than directly call methods on the engines.

In SAPI 5 however, applications and engines do not directly communicate with each other. Instead, each talks to aruntime component (sapi.dll). There is an API implemented by this component which applications use, and another set of interfaces for engines.

Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition). The recognition and synthesis engines also generate events while processing (for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech). These pass in the reverse direction, from the engines, through the runtime DLL, and on to anevent sink in the application.

In addition to the actual API definition and runtime DLL, other components are shipped with all versions of SAPI to make a complete SpeechSoftware Development Kit. The following components are among those included in most versions of the Speech SDK:

  • API definition files - inMIDL and as C or C++ header files.
  • Runtime components - e.g. sapi.dll.
  • Control Panel applet - to select and configure default speech recognizer and synthesizer.
  • Text-To-Speech engines in multiple languages.
  • Speech Recognition engines in multiple languages.
  • Redistributable components to allow developers to package the engines and runtime with theirapplication code to produce a single installable application.
  • Sample application code.
  • Sample engines - implementations of the necessary engine interfaces but with no true speech processing which could be used as a sample for those porting an engine to SAPI.
  • Documentation.

Versions

[edit]

Xuedong Huang was a key person who led Microsoft's early SAPI efforts.

SAPI 1-4 API family

[edit]

SAPI 1

[edit]

The first version of SAPI was released in 1995, and was supported onWindows 95 andWindows NT 3.51. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs.

SAPI 3

[edit]

SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample applications and audio sources.

SAPI 4

[edit]

SAPI 4.0 was released in 1998. This version of SAPI included both the coreCOM API; together withC++ wrapper classes to make programming from C++ easier; andActiveX controls to allow drag-and-dropVisual Basic development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped (with synthesis engines only) inWindows 2000.

The main components of the SAPI 4 API (which were all available in C++, COM, and ActiveX flavors) were:

  • Voice Command - high-level objects for command & control speech recognition
  • Voice Dictation - high-level objects for continuous dictation speech recognition
  • Voice Talk - high-level objects for speech synthesis
  • Voice Telephony - objects for writing telephone speech applications
  • Direct Speech Recognition - objects for direct control of recognition engine
  • Direct Text To Speech - objects for direct control of synthesis engine
  • Audio objects - for reading to and from an audio device or file

SAPI 5 API family

[edit]

TheSpeech SDK version 5.0, incorporating theSAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.

The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition, this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime.

The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems fromWindows 98 andNT 4.0 upwards were supported.

Major features of the API include:

  • Shared Recognizer. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process (sapisvr.exe). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech applications.
  • In-proc recognizer. For applications that require explicit control of the recognition process, the in-proc recognizer object can be used instead of the shared one.
  • Grammar objects. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines anXML markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model.
  • Voice object. This performs speech synthesis, producing an audio stream from a text. A markup language (similar to XML, but not strictly XML) can be used for controlling the synthesis process.
  • Audio interfaces. The runtime includes objects for performing speech input from the microphone or speech output to speakers (or any sound device); as well as to and from wave files. It is also possible to write a custom audio object to stream audio to or from a non-standard location.
  • User lexicon object. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons.
  • Object tokens. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of an object to be registered, enumerated and instantiated in a common way.

SAPI 5.0

[edit]

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese andSimplified Chinese versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.

SAPI 5.1

[edit]

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such asJScript, andmanaged code. This version of the API and TTS engines were shipped inWindows XP.Windows XP Tablet PC Edition and Office 2003 also include this version but with a substantially improved version 6 recognition engine andTraditional Chinese.

SAPI 5.2

[edit]

This was a special version of the API for use only in theMicrosoft Speech Server which shipped in 2004. It added support forSRGS andSSML mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.

SAPI 5.3

[edit]

This is the version of the API that ships inWindows Vista together with new recognition and synthesis engines. AsWindows Speech Recognition is now integrated into the operating system, the Speech SDK and APIs are a part of theWindows SDK. SAPI 5.3 includes the following new features:

  • Support for W3C XML speech grammars for recognition and synthesis. TheSpeech Synthesis Markup Language (SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation.
  • TheSpeech Recognition Grammar Specification (SRGS) supports the definition of context-free grammars, with two limitations:
    • It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.
    • It does not supportAugmented Backus–Naur form (ABNF).
  • Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated withJavaScript for semantic interpretation to supplement the recognized text.
  • User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.
  • Additional functionality and ease-of-programming provided by new types.
  • Performance improvements, improved reliability, and security.
  • Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")

SAPI 5.4

[edit]

This is an updated version of the API that ships inWindows 7.

SAPI 5 Voices

[edit]
Main article:Microsoft text-to-speech voices

Microsoft Sam is a commonly shipped SAPI 5 voice. In addition,Microsoft Office XP andOffice 2003 installedL&H Michael and Michelle voices. The SAPI 5.1 SDK installs 3 more voices,Mike,Mary, and an additional testing voice known asSample TTS Voice that uses prerecorded voice recordings instead of synthesized voices.Windows Vista and7 includesMicrosoft Anna which replaces Microsoft Sam and sounds more natural and intelligible; it is also installed on Windows XP byMicrosoft Streets & Trips 2006 and later versions. The Chinese version of Vista and 7 also includes a female voice namedMicrosoft Lili.Windows 8 and later Windows client versions includesMicrosoft David,Zira, andHazel, the latter of which is only included by default on Windows 8 and8.1. These voices replaced Microsoft Anna and sounds more natural and intelligible than previous voices.

Managed code Speech API

[edit]

Amanaged code API ships as part of the.NET Framework 3.0.[1] It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available onWindows XP,Windows Server 2003,Windows Vista, andWindows Server 2008.

The existing SAPI 5 API can also be used from managed code to a limited extent by creating a COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library.

However, major obstacle towards transitioning from the COM Interop is the fact that the managed implementation has subtlememory leaks which lead to memory fragmentation and exclude the use of the library in any non-trivial applications. As a workaround, Microsoft has suggested using a different API, which has fewer voices.[2]

Speech functionality in Windows Vista

[edit]
See also:Windows Speech Recognition

Windows Vista includes a number of new speech-related features including:

  • Speech control of the full WindowsGUI and applications
  • New tutorial, microphone wizard, and UI for controlling speech recognition
  • New version of the Speech API runtime: SAPI 5.3
  • Built-in updated Speech Recognition engine (Version 8)
  • New Speech Synthesis engine and SAPI voiceMicrosoft Anna
  • Managed code speech API (codenamed SpeechFX)
  • Speech recognition support for 8 languages at release time: U.S. English, U.K. English, traditional Chinese, simplified Chinese, Japanese, Spanish, French, and German, with more language to be released later.

Microsoft Agent most notably, and all other Microsoft speech applications use SAPI 5.

Compatibility

[edit]

The Speech API is compatible with the following operating systems:[3][4]

SAPI 5

[edit]

List as of SAPI version 5.1:[3][4]

Later versions of SAPI 5 (e.g. SAPI 5.3 and above) are compatible with the following operating systems:

SAPI 4

[edit]

Major applications using SAPI

[edit]

See also

[edit]

References

[edit]
  1. ^Michael Dunn."Speech synthesis and recognition in .NET - Give applications a voice". Redmond Developer News. Retrieved2011-11-09.Archived 14 January 2010 at theWayback Machine
  2. ^System. Speech has a memory leak | Microsoft Connect. Connect.microsoft.com. Retrieved on 2013-09-27.
  3. ^abMicrosoft Corporation."SAPI System Requirements". MSDN. Archived fromthe original on 2005-08-22. Retrieved2006-04-12.
  4. ^ab"Welcome to the Microsoft Speech SDK - Microsoft Speech SDK Documentation".documentation.help. Retrieved2025-06-20.

External links

[edit]
Graphics and UI
Audio
Multimedia
Web
Data access
Networking
Communication
Administration and
management
Component model
Libraries
Device drivers
Security
.NET
Software factories
IPC
Accessibility
Text and multilingual
support
Free software
Speaking
Singing
Proprietary
software
Speaking
Singing
Machine
Applications
Protocols
Developers/
Researchers
Process
Controversies
Retrieved from "https://en.wikipedia.org/w/index.php?title=Microsoft_Speech_API&oldid=1308171224"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp