BACKGROUND1. Field of the Invention
The present invention relates to the field of speech processing technologies and, more particularly, to a speech processing system based upon Representational State Transfer (REST) architecture that uses Web 2.0 concepts for speech resource interfaces.
2. Description of the Related Art
In the past, companies having a Web presence thrived by providing as many people broad access to as much information as possible. Information flow was unidirectional, from a company to information consumers. As time has progressed, users have become inundated with too much information from too many sources. Successful Web sites began to provide user-facing information management and information filtration mechanisms designed to aid users in identifying information of interest. Even these Web sites were somewhat flawed in a sense that information still flowed in a unidirectional manner. A user was limited to information gathered and groomed by a particular information provider.
A new type of Web application began to emerge which emphasized user interactions and two-way information exchange. These new Web applications operated more as information marketplaces were people shared information and not as information depots where users accessed a semi-static reservoir of information. This new Web and set of Web applications can be referred to as Web 2.0, where Web 2.0 signifies a second generation of Web based services and applications that emphasize online collaboration and information sharing among users. In other words, a Web 1.0 application would be one that was effectively read-only from a user perspective, where a Web 2.0 application would provide read, write, and update access to end-users. Web 2.0 users can fundamentally change a Web 2.0 application.
Specific examples of Web 2.0 instances include WIKIs, BLOGs, social networking sites, FOLKSONOMIEs, MASHUPs, and the like. All of these Web 2.0 instances allow end-users to add content which other users are able to access. A value of a Web 2.0 Web site is enhanced by the user provided content and may even be completely dependent upon it.
For example, WIKIPEDIA (e.g., one Web 2.0 application) is a WIKI based encyclopedia where each end-user is able to view, add, and edit content. No content would exist without end-user contributions. Information accuracy results from an end-user population constantly updating erroneous entries which other users provide. As new innovations emerge, customers update and add WIKIPEDIA entries that describe these new innovations. Other examples of Web 2.0 applications include MYSPACE.com, YOUTUBE.com, DEL.ICIO.US.com, CRAIGSLIST.com, and the like.
Currently, a schism exists between speech processing technologies and Web 2.0 applications, meaning that Web 2.0 instances do not generally incorporate speech processing technologies. One reason for this is that conventional interfaces to speech resources are too complex for an average end-user to utilize. For this reason, speech technologies are typically only available from Web sites/services that provide a unidirectional flow of information. For example, speech technologies are commonly used by enterprises to handle routine customer interactions via a telephone interface, such as providing bank balances and the like.
One problem contributing to the schism is that speech processing technologies are currently implemented using a non-uniform interface and the Web 2.0 is generally based upon a uniform interface. That is, speech processing operations are accessed via function calls, method invocations, remote procedure calls (RPC), and other messages that are only understood by a specific server or a small subset of components. A specific invocation mechanism and required parameters must be known by a client and must be integrated into an interface. A non-uniform interface is characteristic of RPC based techniques, which includes Simple Object Access Protocol (SOAP), Common Object Request Broker Architecture (COBRA), Distributed Component Object Model (DCOM), JINI, and the like. Without deliberate integration efforts, however, the chances that two software objects designed from an unconstrained architecture are near nil. At best, an ad hoc collection of software objects having vastly different interface requirements results from the RPC style architecture. The lack of uniform interfaces makes integrating speech processing capabilities for each RPC based application a unique endeavor fraught with application specific challenges, which usually require significant speech processing design skills to overcome.
In contrast, a uniform interface exists that includes a few basic primitive commands (e.g., GET, PUT, POST, DELETE) that act upon targets, which in a Web 2.0 context are generally able to be referenced by Uniform Resource Identifiers (URIs). A term used for this type of architecture is Representational State Transfer (REST). REST based solutions simplify component implementation, reduce the complexity of connector semantics, improve the effectiveness of performance tuning, and increase the scalability of pure server components. The Web (e.g., hypertext technologies) in general is founded upon REST principles. Web 2.0 expands these REST principles to permit end users to add (HTTP PUT), update (HTTP POST), and remove (HTTP DELETE) content. Thus, WIKIs, BLOGs, FOLKSONOMIEs, MASHUPs, and the like are all considered RESTful, since each generally follows REST principles.
What is needed to bridge the gap between speech processing resources and conventional Web 2.0 applications is a new paradigm for interfacing with speech processing resources, which makes speech processing resources more available to end-users. In this contemplated paradigm, end-users would optimally be able to cooperatively and dynamically develop speech-enabled solutions, which the end-users would then be able to integrate into Web 2.0 content. Thus, a more robust Web 2.0 environment that incorporates speech processing technologies will be allowed to evolve. This is a stark contrast with a conventional paradigm for interfacing with speech processing resources, which is decisively non-RESTful in nature.
SUMMARY OF THE INVENTIONThe present invention discloses a RESTful speech processing system that uses Web 2.0 concepts for interfacing with server-side speech resources. The RESTful speech processing system can be used to add customizable speech processing capabilities to Web 2.0 instances, such as WIKIs, BLOGs, social networking sites, FOLKSONOMIEs, MASHUPs, and the like. The invention can access speech-enabled applications via introspection documents. Each speech-enabled application can contain a collection of entries and resources. The entries can include Web 2.0 entries, such as WIKI entries and the resources can include speech resources, such as speech recognition, speech synthesis, speech identification, and voice interpreter resources. Each entry and resource can be further decomposed into sub-components specified at a lower granularity level. Each application resource/entry can be introspected, customized, replaced, added, re-ordered, and/or removed by end users.
The present invention can be implemented in accordance with numerous aspects consistent with the material presented herein. For example, one aspect of the present invention can include a speech processing system that includes a client, a speech for Web 2.0 system, and a speech processing system. The client can access a speech-enabled application using at least one Web 2.0 communication protocol. For example, a standard browser of the client can use a HyperText Transfer Protocol (HTTP) to communicate with the speech-enabled application executing on the speech for Web 2.0 system. The speech for Web 2.0 system can access a data store within which user specific speech parameters are included, wherein a user of the client is able to configure the specific speech parameters of the data store. For example, a user can configure which speech resources are available (e.g., TTS, ASR, SIV, VoiceXML interpreter, and the like), resource characteristics (language, grammar, voice gender, speaking rate, and the like), delivery characteristics (real-time or not, synchronous or not, delivery protocol, delivery codec, delivery fidelity, and the like), and other such characteristics. Suitable ones of these speech parameters are utilized whenever the user interacts with the Web 2.0 system. The speech processing system can include one or more speech processing engines. The speech processing system can interact with the speech for Web 2.0 system to handle speech processing tasks associated with the speech-enabled application.
Another aspect of the present invention can include a system for using Web 2.0 as an interface to speech engines. The system can include a Web 2.0 server and a server-side speech processing system. The Web 2.0 server can serve at least one speech-enabled application to at least one remotely located client. The server-side speech processing system can handle speech processing operations for the speech-enabled applications. Communications with the server-side speech processing system can occur via a set of RESTful commands, such as GET, PUT, POST, and DELETE.
Still another aspect of the present invention can include a speech for Web 2.0 system that includes a Web 2.0 server. The Web 2.0 server can serve at least one speech-enabled application to remotely located clients. The speech-enabled application can include an introspection document, a collection of entries, and a collection of resources. At least one of the resources can be a speech resource associated with a speech engine, which adds a speech processing capability to the speech-enabled application.
It should be noted that various aspects of the invention can be implemented as a program for controlling computing equipment to implement the functions described herein, or a program for enabling computing equipment to perform processes corresponding to the steps disclosed herein. This program may be provided by storing the program in a magnetic disk, an optical disk, a semiconductor memory, or any other recording medium. The program can also be provided as a digitally encoded signal conveyed via a carrier wave. The described program can be a single program or can be implemented as multiple subprograms, each of which interact within a single computing device or interact in a distributed fashion across a network space.
It should also be noted that the methods detailed herein can also be methods performed at least in part by a service agent and/or a machine manipulated by a service agent in response to a service request.
BRIEF DESCRIPTION OF THE DRAWINGSThere are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
FIG. 1 is a schematic diagram of a system that utilizes Web 2.0 concepts for speech processing operations in accordance with an embodiment of the inventive arrangements disclosed herein.
FIG. 2 is a schematic diagram of a system for a Web 2.0 for voice system in accordance with an embodiment of the inventive arrangements disclosed herein.
FIG. 3 is a schematic diagram showing a WIKI server adapted for communications with a Web 2.0 for voice system in accordance with an embodiment of the inventive arrangements disclosed herein.
DETAILED DESCRIPTION OF THE INVENTIONFIG. 1 is a schematic diagram of asystem100 that utilizes Web 2.0 concepts for speech processing operations in accordance with an embodiment of the inventive arrangements disclosed herein. Insystem100, auser110 can use aninterface114 ofclient112 to communicate with the speech for Web 2.0system120, which can include a Web 2.0server122 and/or aRESTful server130. When theclient112 is a basic computing device (e.g., a telephone), amiddleware server116 can provide aninterface118 tosystem120.Interface114 and/or118 can be a Web or voice browser, which communicates directly withsystem120 using Web 2.0 conventions.Applications126, which theclient112 accesses, can be voice-enabled applications stored indata store124. A type of browser (e.g.,interface114 and/or118) used to access theapplications126 can be transparent to thesystem120, or can be transparent at least toRESTful server130 ofsystem120.
TheRESTful server130 can provide speech processing operations forapplications126 by interfacing withspeech processing system150. Communications between the Web 2.0server122 and theRESTful server130 can be REST based communications, such as those conducted using the ATOM PUBLISHING PROTOCOL (APP). In one embodiment,servers122 and130 can be functionally integrated into a single server of speech for Web 2.0system120.
TheRESTful server130 can utilize a set of basic commands enabling thecommand engine132 to conduct speech processing operations. The commands can be REST commands that include an HTTP GET, an HTTP POST, and HTTP PUT, and an HTTP DELETE command. TheRESTful server130 can also include an introspection/discovery engine134 and/or amedia engine136 as well asdata store138.
Data store138 can include a set of documents140, such as introspection documents142, entry collection documents144, and resource collection documents146. The documents140 together can link theRESTful server130 to speech processing engines156 ofspeech processing server150 and can control behavior ofspeech processing server150. The documents140 and resulting behavior of thespeech processing server150 can be configured byuser110 in a user-specific manner. That isdifferent users110 can inject their own voice characteristics, markup, behavior, and/or other features, which thespeech processing system150 utilizes.
The Web 2.0system120 can be communicatively linked to one ormore enterprise servers158 having an associateddata store160. Thus, the Web 2.0system120 can be a communication intermediary which providesuser110 with access to information and services of the enterprise server anddata store160.
Web 2.0system120 can further be communicatively linked to one or more additionalRESTful servers162, each associated with adata store164, within which a set of documents, approximately equivalent to documents140, are stored. Communications between Web 2.0system120 andspeech processing system150 orRESTful server162 can be based on a RESTful protocol, such as APP.
It should be appreciated thatRESTful servers130 and162 are able to operate in a stateless fashion which permitsRESTful server162 to seamlessly replace functionality ofserver130. That is, state information does not have to be transferred when control is transferred from oneserver130 to another162. Thus,system100 provides a highly scalable solution (i.e., when under a heavy load,server130 can transfer load to server162) and can provide fault tolerance and recovery capabilities (i.e., whenserver130 experiences runtime problems, a differentoperational server162 can immediately perform operations previously handled by server130).
Another point aboutsystem100 that should be emphasized is thatclient112 is able to interact with the speech-enabledapplication126 using Web 2.0 communication protocols only. No special client-side speech interface is required. At the same time, theuser110 is able to customize/personalize/configure speech processing behavior at low-levels.
As used herein, Web 2.0 is a concept that refers to a cooperative Web in which end-users110 add value by providing content, as opposed to Web systems that unidirectionally provide information from an information provider to an information consumer. In other words, Web 2.0 refers to a readable, writable, and updateable Web. While a myriad of types of Web 2.0 instances exist, some currently popular ones include WIKIs, BLOGS, MASHUPs, FOLKSONOMIEs, social networking sites, and the like.
REST refers to a Representational State Transfer architecture. A REST approach focuses on utilizing a constrained operation set, such as GET, PUT, POST, and DELETE, to act against a set of structured targets which can be URL addressable. A REST architecture is a client/server architecture which is stateless, cacheable, and layered by nature. REST replaces a paradigm of do-something with a make-something-so concept. That is, instead of attempting to execute a kind of state transition for a software object, the REST concept changes a state of a software object to a user designated state. A RESTful object (e.g.,RESTful server130,162) is one which primarily conforms to REST concepts. A RESTful interface can be a simple interface that transmits domain-specific data using an HTTP based protocol without utilizing an additional messaging layer, such as SOAP, and without reliance of session tracking HTTP cookies.
Theclient112 can be any computing device capable of communicating with either thesystem120 ormiddleware server116. In one embodiment,client112 can include aWeb browser114, which operates as an interface between theuser110 and thesystem120. In another embodiment, theclient112 can be a voice communication device that communicates with themiddleware server116, which can include avoice browser118. In these embodiments, specific instances of theclient112 can include a computer, a Web station, a media player, a telephone, a smart phone, and the like
Web 2.0server120 can be aserver120 that provides Web content to interface114 and/or118 and which permits auser110 to provide additional Web content, which is made available to other users. The Web 2.0 server can be a WIKI server, a BLOG server, a social networking server, a MASHUP server, a FOLKSONOMY server, and the like. In one embodiment, theWeb120 can be a RESTful server, in which case functionality shown forserver130 can be incorporated withinserver120. Alternatively, a transformer can be included in Web 2.0 server, which converts content between a server-specific format (e.g., a WIKI format) and a RESTful format, such as a format adhering to an APP based protocol.
RESTful server130 and162 can be a server adhering to REST concepts, which links theserver120 tospeech processing server150. In one embodiment, theRESTful server130 can be an APP server. RESTful commands can be issued bycommand engine132, which are received and processed by command interpreter154. Amedia interface136 of theRESTful server130 can control caching, delivery, fidelity, and formatting of delivered media, which includes delivered speech. Delivery can be in accordance with a streaming protocol, a file based protocol, a real-time protocol, and the like.
Speech processing server150 can be any networked server or speech processing system which is able to process speech requests using one or more speech engines156. In one embodiment, thespeech processing server150 can be a turn-based and/or clustered system capable of handling multiple requests in real-time. For example,speech processing server150 can be implemented as a WEBSPHERE VOICE SERVER or other such commercially available product. Management tasks of theserver150 can be handled by themanagement processor152. The various speech engines156 can include ASR, TTS, SIV, voice markup interpreters, and the like.
Data stores124,138,160, and164 can be a physical or virtual storage space configured to store digital information.Data stores124,138,160, and164 can be physically implemented within any type of hardware including, but not limited to, a magnetic disk, an optical disk, a semiconductor memory, a digitally encoded plastic memory, a holographic memory, or any other recording medium. Each of thedata stores124,138,160, and164 can be a stand-alone storage unit as well as a storage unit formed from a plurality of physical devices. Additionally, information can be stored withindata stores124,138,160, and164 in a variety of manners. For example, information can be stored within a database structure or can be stored within one or more files of a file storage system, where each file may or may not be indexed for information searching purposes. Further,data stores124,138,160, and164 can utilize one or more encryption mechanisms to protect stored information from unauthorized access.
The components ofsystem100 can be communicatively linked to each other via a network (not shown). The network can include any hardware/software/and firmware necessary to convey data encoded within carrier waves. Data can be contained within analog or digital signals and conveyed though data or voice channels. The network can include local components and data pathways necessary for communications to be exchanged among computing device components and between integrated device components and peripheral devices. The network can also include network equipment, such as routers, data lines, hubs, and intermediary servers which together form a data network, such as the Internet. The network can also include circuit-based communication components and mobile communication components, such as telephony switches, modems, cellular communication towers, and the like. The network can include line based and/or wireless communication pathways.
FIG. 2 is a schematic diagram of asystem200 for a Web 2.0 forvoice system230 in accordance with an embodiment of the inventive arrangements disclosed herein.System200 can be an alternative representation and/or an embodiment for thesystem100 ofFIG. 1 or for a system that provides approximately equivalent functionality assystem100 utilizing Web 2.0 concepts to provide speech processing capabilities.
Insystem200, Web 2.0clients240 can communicate with Web 2.0 servers210-214 utilizing a REST/ATOM250 protocol. The Web 2.0 servers210-214 can serve one or more speech-enabled applications220-224, where speech resources are provided by a Web 2.0 forVoice system230. One or more of the applications220-224 can includeAJAX256 or other JavaScript code. In one embodiment, theAJAX256 code can be automatically converted from WIKI or other syntax by a transformer of a server210-214.
Communications between the Web 2.0 servers210-214 andsystem230 can be in accordance with REST/ATOM256 protocols. Each speech-enabled application220-224 can be associated with anATOM container231, which specifies Web 2.0items232,resources233, andmedia234. One ormore resource233 can correspond to aspeech engine238.
The Web 2.0clients240 can be any client capable of interfacing with a Web 2.0 server210-214. For example, theclients240 can include a Web orvoice browser241 as well as any other type ofinterface244, which executes upon a computing device. The computing device can include amobile telephone242, amobile computer243, a laptop, a media player, a desktop computer, a two-way radio, a line-based phone, and the like. Unlike conventional speech clients, theclients240 need not have a speech-specific interface and instead only require a standard Web 2.0 interface. That is, there are no assumptions regarding theclient240 other than an ability to communicate with a Web 2.0 server210-214 using Web 2.0 conventions.
The Web 2.0 servers210-214 can be any server that provides Web 2.0 content toclients240 and that provides speech processing capabilities through the Web 2.0 forvoice system230. The Web 2.0 servers can include aWIKI server210, aBLOG server212, a MASHUP server, a FOLKSONOMY server, a social networking server, and any other Web 2.0server214.
The Web 2.0 forvoice system230 can utilize Web 2.0 concepts to provide speech capabilities. A server-side interface is established between thevoice system230 and a set of Web 2.0 servers210-214. Available speech resources can be introspected and discovered via introspection documents, which are one of the Web 2.0items232. Introspection can be in accordance with the APP specification or a similar protocol. The ability for dynamic configuration and installation is exposed to the servers210-214 via the introspection document.
That is, access to Web 2.0 forvoice system230 can be through a Web 2.0 server that lets users (e.g., clients240) provide their own customizations/personalizations. Appreciably, use of theAPP256 opens up the application interface to speech resources using Web 2.0, JAVA 2 ENTERPRISE EDITION (J2EE), WEBSPHERE APPLICATION SERVER (WAS), and other conventions, rather than being restricted to protocols, such as media resource control protocol (MRCP), real time streaming protocol (RTSP), or real time protocol (RTP).
A constrained set of RESTful commands can be used to interface with the Web 2.0 forvoice system230. RESTful commands can include a GET command, a POST command, a PUT command, and a DELETE command, each of which is able to be implemented as an HTTP command. As applied to speech, GET (e.g., HTTP GET) can return capabilities and elements that are modifiable. The GET command can also be used for submitting simplistic speech queries and for receiving query results.
The POST command can create media-related resources usingspeech engines238. For example, the POST command can create an audio “file” from input text using a text-to-speech (TTS)resource233 which is linked to aTTS engine238. The POST command can create a text representation given an audio input, using an automatic speech recognition (ASR)resource233 which is linked to anASR engine238. The POST command can create a score given an audio input, using a Speaker Identification and Verification (SIV) resource which is linked to aSIV engine238. Any type of speech processing resource can be similarly accessed using the POST command.
The PUT command can be used to update configuration of speech resources (e.g., default voice-name, ASR or TTS language, TTS voice, media destination, media delivery type, etc.) The PUT command can also be used to add a resource or capability to a Web 2.0 server210-214 (e.g. installing an SIV component). The DELETE command can remove a speech resource from a configuration. For example, the DELETE command can be used to uninstall a previously installed speech component.
The Web 2.0 forVoice system230 is an extremely flexible solution that permits users (of clients240) to customize numerous speech processing elements. Customizable speech processing elements can include speech resource availability, request characteristics, result characteristics, media characteristics, and the like. Speech resource availability can indicate whether a specific type of resource (e.g., ASR, TTS, SIV, Voice XML interpreter) is available. Request characteristics can refer to characteristics such as language, grammar, voice attributes, gender, rate of speech, and the like. The result characteristics can specify whether results are to be delivered synchronously or asynchronously. Result characteristics can alternatively indicate whether a listener for callback is to be supplied with results. Media characteristics can include input and output characteristics, which can vary from a URI reference to an RTP stream. The media characteristics can specify a codec (e.g., G711), a sample rate (e.g., 8 KHz to 22 KHz), and the like. In one configuration, thespeech engines238 can be provided from aJ2EE environment236, such as a WAS environment. Thisenvironment236 can conform to a J2EE Connector Architecture (JCA)237.
In one embodiment, a set ofadditional facades260 can be utilized on top of Web 2.0 protocols to provide additional interface andprotocol262 options (e.g., MRCP, RTSP, RTP, Session Initiation Protocol (SIP), etc.) to the Web 2.0 forvoice system230. Use offacades260 can enable legacy access/use of the Web 2.0 forvoice system230. Thefacades260 can be designed to segment theprotocol262 from underlying details so that characteristics of the facade do not bleed through to speech implementation details. Functions, such as the WAS 6.1 channel framework or a JCA container, can be used to plug-in a protocol, which is not native to theJ2EE environment236. Themedia component234 of thecontainer231 can be used to handle media storage, delivery, and format conversions as necessary.Facades260 can be used for asynchronous orsynchronous protocols262.
FIG. 3 is a schematic diagram showing aWIKI server330 adapted for communications with a Web 2.0 for voice system310 in accordance with an embodiment of the inventive arrangements disclosed herein. Although aWIKI server330 is illustrated,server330 can be any WEB 2.0 server (e.g.,server120 ofsystem100 or server210-214 of system200) including, but not limited to, a BLOG server, a MASHUP server, a FOLKSONOMY server, a social networking server, and the like.
In thesystem300, abrowser320 can communicate with Web 2.0server330 via Representational State Transfer (REST) architecture / ATOM304 based protocol. The Web 2.0server330 can communicate with a speech for Web 2.0 system310 via a REST/ATOM302 based protocol.Protocols302,304 can include HTTP and similar protocols that are RESTful by nature as well as an Atom Publishing Protocol (APP) or other protocol that is specifically designed to conform to REST principles.
The Web 2.0server330 can include adata store332 in whichapplications334, which can be speech-enabled, are stored. In one embodiment, theapplications332 can be written in a WIKI or other Web 2.0 syntax and can be stored in an APP format.
The contents of theapplication332 can be accessed and modified usingeditor350. Theeditor350 can be a standard WIKI or other Web 2.0 editor having a voice plug-in orextensions352. In one implementation, user-specific modifications made to the speech-enabledapplication334 via theeditor350 can be stored in customization data store as a customization profile and/or a state definition. The customization profile and state definition can contain customization settings that can override entries contained within theoriginal application332. Customizations can be related to a particular user or set of users.
Thetransformer340 can convert WIKI or other Web 2.0 syntax into standard markup for browsers. In one embodiment, thetransformer340 can be an extension of a conventional transformer that supports HTML and XML. Theextended transformer340 can be enhanced to handle JAVA SCRIPT, such as AJAX. For example, resource links ofapplication332 can be converted into AJAX functions by thetransformer340 having an AJAX plug-in342. Thetransformer340 can also include a VoiceXML plug-in344, which generates VoiceXML markup for voice-only clients.
The present invention may be realized in hardware, software, or a combination of hardware and software. The present invention may be realized in a centralized fashion in one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention also may be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
This invention may be embodied in other forms without departing from the spirit or essential attributes thereof. Accordingly, reference should be made to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.