US20060106613A1

Movatterモバイル変換

Info

Publication number: US20060106613A1
Application number: US11/319,989
Authority: US
Inventors: Scott Mills
Original assignee: SBC Technology Resources Inc
Current assignee: AT&T Labs Inc; Nuance Communications Inc
Priority date: 2002-03-26
Filing date: 2005-12-28
Publication date: 2006-05-18
Also published as: US20030187639A1; US7016842B2

Abstract

A method and system for evaluating telephone services provided by speech recognition interfaces an evaluation engine with a voice recognition service over a telephone system to submit speech utterance samples to the voice recognition service, receive the response of the voice recognition service to the sample utterances, and determine error and recognition of the sample utterances by the voice recognition service by comparing actual voice recognition service responses to expected responses. The evaluation engine permits evaluation of a voice recognition service for plural glossaries in different contexts, such as through predetermined nodes of a voice recognition service menu having plural glossaries.

Description

TECHNICAL FIELD OF THE INVENTION

The present invention relates generally to the evaluation of automated speech recognition, and more specifically relates to the evaluation of the effectiveness of automated speech recognition in providing a telephone service.

BACKGROUND OF THE INVENTION

Automatic speech recognition (ASR) technology interacts with human users by recognizing speech commands and responding with some action, such as providing users with information. ASR uses processor intensive evaluation of digitized voice signals to recognize human speech. For instance, ASR compares a digitized voice signal against a glossary, also known as a vocabulary, of expected responses and identifies the digitized voice signal as an expected response if a match is found with a great enough confidence. In order to improve the reliability of an ASR system, glossaries of expected responses are typically fine tuned to adapt as much as possible to variations in human voices and noise signals for a likely set of commands. ASR technology has steadily improved in terms of reliability and speed as processing capability and processing techniques have improved so that ASR technology is growing increasingly popular as a user friendly interface for businesses.

One application for ASR technology that is gaining wide acceptance is the use of voice recognition for providing services through a telephone network. Voice recognition offers a friendly alternative to touch tone services provided through DTMF signals and also reduces the cost otherwise associated with live operator support of customer inquiries. In particular, voice recognition based telephone services have grown increasingly popular in providing services through mobile devices such as wireless or cell phone networks because users are able to access information “hands off” making cell phone use safer, such as in driving conditions. As the quality of voice recognition applications has improved, an increasing number of services have become available ranging from obtaining driving directions, weather information, flight information and reservations and even stock quotes. For instance, Cingular wireless offers a variety of services supported by voice recognition through Cingular's VOICE CONNECT service.

When it works, voice recognition technology offers clear advantages for inputting requests to a telephone system compared with touch pad DTMF signaling and offers considerable cost advantages over the use of live operators. However, when voice recognition fails or performs unreliably, voice recognition introduces considerable user frustration. Thus, to improve reliability, voice recognition applications are typically tuned for a given set of expected commands and conditions. For instance, within a given service, separate glossaries of responses are often used to improve reliability by increasing the likelihood that a voice request will be recognized, with each glossary designed to address a set of commands. Further, glossaries are fine tuned periodically to adapt to changing conditions and respond to reliability problems. These fine tunings are in addition to changes implemented for menu items and additional services.

One significant difficulty with updating and improving the reliability of services supported by voice recognition is that changes and updates to voice recognition glossaries to support menu changes will have an effect on the service as a whole, for instance by altering recognition rates where glossaries are applied in different contexts. When voice recognition is deployed to a telephone service the overall impact of fine tuning of a glossary is difficult to predict for the application of the glossary in different contexts, such as in combination with other glossaries, especially when real live factors like noise and variations in voices are taken into account.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the present embodiments and advantages thereof may be acquired by referring to the following description taken in conjunction with the accompanying drawing, in which like reference numbers indicate like features, and wherein:

The FIGURE depicts a block diagram of a system for evaluating automatic speech recognition services provided through a telephone network.

DETAILED DESCRIPTION OF THE INVENTION

Preferred embodiments of the present invention are illustrated in the figures, like numerals being used to refer to like and corresponding parts of the various drawings.

Voice recognition glossaries are typically designed and applied to optimize recognition of a set of expected commands, such as names of cities. However, in a voice recognition service having a variety of menu nodes, a number of different glossaries are typically used with each menu node. Thus, at any given node, a context having a wide variety of combinations of expected commands is possible so that voice utterances intended for recognition by one glossary may have an impact on recognition by other glossaries associated with a menu node. In other words, an accurate measure of the usability of a service that uses voice recognition is difficult to obtain from abstract testing of individual nodes or glossaries.

In order to evaluate voice recognition services using different combinations of one or more glossaries, the present invention uses sample utterances in different contexts to determine error and recognition rates. For instance, a sample utterance evaluates recognition of a voice command at different menu nodes of a voice recognition service so that evaluation of the response to the command is within a context of glossaries applied at that node.

Referring now to the FIGURE, a block diagram depicts a system for evaluating automatic speech recognition services provided by a telephone network. The system evaluates voice recognition services by interfacing with the service through the telephone network and submitting speech sample utterances and determining recognition and error rates for the voice recognition service. The evaluation is performed in either a manual or an automated mode by comparing expected responses to sample utterances against actual responses to identify errors and determine system reliability.

Anevaluation engine10 performs the evaluation of voice recognition services by sending selected sample utterances through the telephone network, receiving responses from the voice recognition service and determining error and recognition rates for the sample utterances. Aconfiguration engine12 interacts with a user to establish atest configuration14 and to provide the sample utterances oftest configuration14 to the voice recognition service through atelephone system interface16. Responses from the voice recognition service are received attelephone system interface16 and are provided to an error andrecognition assessment engine18. Error andrecognition assessment engine18 compares received results against expected results from the sample utterance sent according totest configuration14. Error and recognition rates are determinable either through user interaction of comparing recorded sample utterances and recorded voice recognition service responses or by automated comparisons that track sample utterances and voice recognition service responses by error and recognition occurrences.

Telephone system interface

16 communicates withtelephone network22 through aphysical interface20, such as a hybrid coupler phone tap. For instance,evaluation engine10 resides on a personal computer having a phone tapphysical interface20 that allowsevaluation engine10 to directly dial throughnetwork22 to communicate with avoice recognition service24. Direct communication betweenevaluation engine10 andvoice recognition service24 allows emulation of voice commands so thatevaluation engine10 is able to navigate through a voice recognition service menu either by following atest configuration14 or by manual manipulation through a user interface. Thus, for instance, if a problem is noted with a voice recognition service, a technician may manually navigate through the nodes of the menu with a variety of sample utterances to evaluate the extent of the difficulty or may design a test configuration that provides an automated navigation of the menu and reports error and recognition rates.

In one embodiment, in addition to the voice emulation interaction withvoice recognition service24, anevaluation engine module26 associated withvoice recognition service24 establishes a logical link withtelephone system interface16 to allow coordination withtest configuration14. For instance,evaluation engine module26 bringsvoice recognition service24 to a menu node corresponding to a menu node identified inconfiguration engine14 so that sample utterances are submitted for evaluation without having to follow the menu tree between nodes of the voice recognition service. Thus, as an example,telephone system interface16 may send sample utterances associated with one or more predetermined menu nodes in a repeated manner not bound by the menu of the voice recognition service withevaluation engine module26 bringingvoice recognition service24 to the predetermined node before each sample utterance is sent.

An evaluationgraphical user interface28 allows user interaction withevaluation engine10 to establish and runtest configurations14. Evaluationgraphical user interface28 is, for example, created with Visual Basic to operate on a Windows based personal computer, although other embodiments may use alternative programming applications and operating systems. Evaluationgraphical user interface28 applies amap30 ofvoice recognition service24 and alibrary32 of sample utterances, such as digitized voice samples stored as wave files having a “.wav” extension, to design atest configuration14 in atest configuration window34.Service buttons36 allow the design of atest configuration14 for a selectedvoice recognition service24 and allows establishment of basic contact information, such as the telephone number to dial for thevoice recognition service24 that is selected.Speaker buttons38 allow the selection of sample utterances classified by the speaker that generated the utterances.Noise buttons40 allow atest configuration14 to include simulated levels of noise such as static, road noise and/or crowd noise. Ago button42 initiates testing.

Map window

30 andlibrary window32

access configuration engine

12 to allow selection of atest configuration14 throughtest configuration engine34.Configuration engine12 presents a voice recognition service menu onmap window30 and a library of stored digital sample utterances inlibrary32 from a voice recognition servicemenu data base44 and sample utterancelibrary data base46.Menu data base44 includes a series of nodes corresponding to the menu items ofvoice recognition service24.

For instance, when a caller callsvoice recognition service24 the call is initially handled at a main menu node which provides generalized areas of inquiry that allow the user to select more specific information from children nodes of the main menu node. As an example, main menu node ofmenu data base44 provides a user with options to select children nodes including driving directions, weather, flight information, or stock quotes. The user selects an appropriate child node from the main menu by saying “go to driving directions”, “go to weather”, “go to flight information”, or “go to stock quotes”, as depicted by the utterances oflibrary data base46.

The selections available from the main menu node are often global selections that a user may state from any child node to proceed automatically to a selected child node or the parent main menu node. For instance, a user who selects flight information may automatically proceed to weather information from the flight information child node by stating the utterance of the main menu node “go to weather”. The child nodes of the main menu node in turn have child nodes that aid callers in determining specific information. For instance, the flight information child node allows a user to select an airline, destination and arrival city, as well as destination and arrival times. The weather child node allows a user to select a city. The driving directions child node allows a user to select a location, possibly a city or a landmark within a city. The stock quotes child node allows a user to select a company such as Southwestern Bell Corporation by either the company's name or ticker symbol, SBC.

Thevoice recognition service24 applies one or more glossaries at each node ofmenu data base44 to identify appropriate information for a caller. For instance, each node is tuned for voice recognition of expected requests of a caller to improve efficiency and reliability of the voice recognition service. One difficulty with the use of different glossaries is that one or more utterances may overlap between different nodes of the menu leading to reduced service reliability. For instance, a node that relates to stock quotes may fail to recognize global glossary utterances due to the relationship between the utterances for stock quotes available through the service and the utterances associated with a global menu node, such as the main menu node. In such a situation, a caller at the stock quotes node who commands “go to main menu” instead could receive an unrequested stock quote, resulting in caller frustration and an inability to proceed to the main menu.

Test configuration window

34 provides a drop and drag environment for creating atest configuration14 by selecting nodes frommap window30 and sample utterances for the node fromlibrary window32. In the most simple example, a user contactsvoice recognition service24 throughtelephone system interface16 and manually selecting sample utterances fromlibrary window32 based on a speaker selected frombutton38. For instance, oncetelephone system interface16 establishes contact with voice recognition service24 a user selects “go to driving directions” fromlibrary window32 stated by a speaker selected bybutton38. In this manner, the user may navigate the menu ofvoice recognition service24 as a normal caller but with sample utterances and simulated noise conditions. The error or recognition results of the response are tracked by error andrecognition engine18 which provides an automated comparison to expected voice recognition service responses, records responses for future comparison or tabulates error or recognition results based upon a manual determination made by the user.

In an alternative embodiment,test configuration window34 automates atest configuration14 forevaluation engine10 to run in cooperation withvoice recognition service24. For example, the test configuration depicted inwindow34 of the FIGURE illustrates navigation through four voice recognition service nodes with selected sample utterances at each node.Evaluation engine10 automates interaction withvoice recognition service24 according totest configuration14 as designed intest configuration window34 so that, for instance, a desired test configuration may be repeatedly run with different speaker and noise conditions. Error andrecognition assessment engine18 tracks responses tovoice recognition service24 and tabulates results based on a comparison of actual and expected responses byvoice recognition service24 to sample utterances.Evaluation engine module26 automates the navigation ofvoice recognition service24 to enable a more rapid navigation through nodes to be tested by avoiding the need to navigate by voice commands.

One advantage ofevaluation engine10 is thattest configurations14 allows the testing of speech recognition and error rates based on context. For example, the error and recognition rates associated with a particular glossary or glossaries are tested within the context of the voice recognition service. As glossaries are updated and tuned for a node or nodes of a voice recognition menu, the impact of such updates or tuning is tested so that the response of a voice recognition service in different contexts is determined. For instance, the addition of a new stock for stock quotes to a voice recognition service glossary may have unintended impacts on a global glossary such as the main menu so that a caller at the stock quotes node who states “go to main menu” has a greater likelihood of voice recognition error in the stock quote context than in the main menu context. Indeed, as voice recognition service menus grow more complicated, it becomes more difficult to design glossaries for a particular context so that the glossaries take into account the myriad of other menu items that may be available to callers of a voice recognition service in the context of that node.

One example of an advantage ofevaluation engine10 is that it provides a practical testing tool that identifies potential problems with a voice recognition service in the actual context of the service as opposed to separate testing of the glossaries. Thus, as services are updated to include additional nodes, changes to nodes or fine-tuning of glossaries, test configurations run byevaluation engine10 allow a determination of the effect of changes in the actual context of the voice recognition service. By identifying potential recognition errors in the context of the voice recognition service,evaluation engine10 provides a basis for improving node and glossary design for a voice recognition service as a whole.

Another example of an advantage ofevaluation engine10 is that it provides a user-friendly testing platform to evaluate the effectiveness of a voice recognition service provided through a telephone network. For instance, complaints by telephone network users about particular menu node or voice command failures may be tested through a simulated interaction that emulates the conditions of the reported failure. Automated test configurations with sample utterances from a range of speakers and conditions allows the pinpointing of problem areas to provide specific areas for improvement, thus reducing the cost and improving the results of future updates.

Another example of an advantage is thatevaluation engine10 is flexible to adapt to a variety of services, including services provided by different vendors. For instance, becauseevaluation engine10 interfaces with services through a telephone network, it provides a base testing platform for comparing services provided by different vendors by initiating interaction with each service as a customer.

Although the present invention has been described in detail, it should be understood that various changes, substitutions and alterations can be made hereto without the parting from the spirit and scope of the invention as defined by the appended claims.

Claims

1-20. (canceled)

21. A system for evaluating a voice recognition service, comprising:

an evaluation graphical user interface (GUI) operable to facilitate the generation of a test configuration for evaluating a voice recognition service, the evaluation GUI including:

a node selection interface allowing the user to select from a plurality of voice recognition menu nodes a particular node to be evaluated; and

a utterance selection interface allowing the user to select one or more sample utterances for the evaluation of the voice recognition service; and

an evaluation engine communicatively coupled to the evaluation GUI and configured to cooperate with the evaluation GUI to generate the test configuration, the evaluation engine including:

a voice recognition service interface for communicating to the voice recognition service the one or more selected sample utterances and for receiving from the voice recognition service one or more actual responses to the one or more selected sample utterances; and

an assessment engine for comparing the one or more actual responses to one or more expected responses to determine recognition rates of the voice recognition service.

22. A system according toclaim 21, wherein:

the node selection interface comprises a map interface configured to display a map of the plurality of voice recognition menu nodes indicating relationships between particular nodes; and

the utterance selection interface comprises a library interface configured to display a library of sample utterances from which the user can select the one or more sample utterances.

23. A system according toclaim 21, wherein the node selection interface comprises a map interface configured to display a map of the plurality of voice recognition menu nodes indicating the relationships between particular nodes and to allow the user to use the map to select the particular node to be evaluated.

24. A system according toclaim 21, wherein the evaluation GUI further includes a speaker selection interface allowing the user to select a particular speaker from a plurality of different speakers; and

wherein the utterance selection interface comprises a library interface configured to display a library of sample utterances generated by the selected particular speaker and to allow the user to select the one or more sample utterances from the displayed library.

25. A system according toclaim 21, wherein the evaluation GUI further includes a noise selection interface allowing the user to select one or more simulated background noises.

26. A system according toclaim 25, wherein the one or more simulated background noises comprise at least one of static noise, road noise, and crowd noise.

27. A system according toclaim 21, further comprising a telephony interface for communicating the one or more selected sample utterances to the voice recognition service.

28. A method for evaluating a voice recognition service, comprising:

receiving a user selection of a particular voice recognition menu node to be evaluated, the particular node selected from a plurality of voice recognition menu nodes;

receiving a user selection of one or more sample utterances for an evaluation of the particular node;

communicating the one or more selected sample utterances to a voice recognition service;

receiving one or more actual responses to the one or more selected sample utterances from the voice recognition service; and

comparing the one or more actual responses to one or more expected responses to determine recognition rates of the voice recognition service.

29. A method according toclaim 28, wherein the user selection of the particular node and the one or more sample utterances are provided by an evaluation graphical user interface.

30. A method according toclaim 28, further comprising:

displaying a map of the plurality of voice recognition menu nodes indicating relationships between particular nodes; and

displaying a library of sample utterances from which the user can select the one or more sample utterances.

31. A method according toclaim 28, further comprising:

generating an evaluation graphical user interface (GUI) that displays a map of the plurality of voice recognition menu nodes indicating the relationships between particular nodes; and

receiving the user selection of the particular node to be evaluated via the evaluation GUI.

32. A method according toclaim 28, further comprising generating an evaluation graphical user interface (GUI), the evaluation GUI:

allowing a user to select a particular speaker from a plurality of different speakers;

displaying a library of sample utterances generated by the selected particular speaker; and

allowing the user to select from the displayed library the one or more sample utterances.

33. A method according toclaim 28, further comprising receiving a user selection of one or more simulated background noises.

34. A method according toclaim 33, wherein the one or more simulated background noises comprise at least one of static noise, road noise, and crowd noise.

35. A method according toclaim 28, wherein the one or more selected sample utterances are communicated to the voice recognition service through a telephony interface.

36. Computer instructions encoded in computer-readable media and executable by a processor, comprising:

logic for receiving a user selection of a particular voice recognition menu node to be evaluated, the particular node selected from a plurality of voice recognition menu nodes;

logic for receiving a user selection of one or more sample utterances for an evaluation of the particular node;

logic for communicating to a voice recognition service the one or more selected sample utterances;

logic for receiving from the voice recognition service one or more actual responses to the one or more selected sample utterances; and

logic for comparing the one or more actual responses to one or more expected responses to determine recognition rates of the voice recognition service.

37. Computer logic according toclaim 36, further comprising:

logic for displaying a map of the plurality of voice recognition menu nodes indicating relationships between particular nodes; and

logic for displaying a library of sample utterances from which the user can select the one or more sample utterances.

38. Computer logic according toclaim 36, further comprising:

logic for generating an evaluation graphical user interface (GUI) that displays a map of the plurality of voice recognition menu nodes indicating the relationships between particular nodes; and

logic for receiving the user selection of the particular node to be evaluated via the evaluation GUI.

39. Computer logic according toclaim 36, further comprising logic for generating an evaluation graphical user interface (GUI) including:

logic for allowing the user to select a particular speaker from a plurality of different speakers;

logic for displaying a library of sample utterances generated by the selected particular speaker; and

logic for allowing the user to select the one or more sample utterances from the displayed library.

40. Computer logic according toclaim 36, further comprising logic for communicating the one or more selected sample utterances to the voice recognition service through a telephony interface.