Movatterモバイル変換


[0]ホーム

URL:


US7076430B1 - System and method of providing conversational visual prosody for talking heads - Google Patents

System and method of providing conversational visual prosody for talking heads
Download PDF

Info

Publication number
US7076430B1
US7076430B1US10/173,341US17334102AUS7076430B1US 7076430 B1US7076430 B1US 7076430B1US 17334102 AUS17334102 AUS 17334102AUS 7076430 B1US7076430 B1US 7076430B1
Authority
US
United States
Prior art keywords
virtual agent
speech
movement
prosody
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime, expires
Application number
US10/173,341
Inventor
Eric Cosatto
Hans Peter Graf
Thomas M. Isaacson
Volker Franz Strom
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
AT&T Properties LLC
Original Assignee
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US10/173,341priorityCriticalpatent/US7076430B1/en
Assigned to AT&T CORP.reassignmentAT&T CORP.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: ISAACSON, THOMAS M., STROM, VOLKER FRANZ, COSATTO, ERIC, GRAF, HANS PETER
Application filed by AT&T CorpfiledCriticalAT&T Corp
Priority to US11/237,561prioritypatent/US7349852B2/en
Priority to US11/237,557prioritypatent/US7353177B2/en
Publication of US7076430B1publicationCriticalpatent/US7076430B1/en
Application grantedgrantedCritical
Priority to US12/020,049prioritypatent/US7844467B1/en
Priority to US12/019,974prioritypatent/US8200493B1/en
Assigned to AT&T INTELLECTUAL PROPERTY II, L.P.reassignmentAT&T INTELLECTUAL PROPERTY II, L.P.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: AT&T PROPERTIES, LLC
Assigned to AT&T PROPERTIES, LLCreassignmentAT&T PROPERTIES, LLCASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: AT&T CORP.
Assigned to NUANCE COMMUNICATIONS, INC.reassignmentNUANCE COMMUNICATIONS, INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: AT&T INTELLECTUAL PROPERTY II, L.P.
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLCreassignmentMICROSOFT TECHNOLOGY LICENSING, LLCASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: NUANCE COMMUNICATIONS, INC.
Adjusted expirationlegal-statusCritical
Expired - Lifetimelegal-statusCriticalCurrent

Links

Images

Classifications

Definitions

Landscapes

Abstract

A system and method of controlling the movement of a virtual agent while the agent is listening to a human user during a conversation is disclosed. The method comprises receiving speech data from the user, performing a prosodic analysis of the speech data and controlling the virtual agent movement according to the prosodic analysis.

Description

PRIORITY APPLICATION
The present application claims priority to Provisional Patent Application No. 60/380,952 filed May 16, 2002, the contents of which are incorporated herein by reference.
RELATED APPLICATION
The present application is related to U.S. application Ser. No. 10/173,184, filed on Jun. 17, 2002, the contents of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to controlling animations and more specifically to a system and method of providing reactive behavior to virtual agents when a human/computer interaction is taking place.
2. Discussion of Related Art
Much work has recently been focused on generating visual Text-to-Speech interactions between a human user and a computer device. The natural interaction between a computer and human is increasing as conversational agents or virtual agents improve, but the widespread acceptance and use of virtual agents is hindered by un-natural interactions with the virtual agent. Studies show that a customer's impression of a company's quality is heavily influenced by the customer's experience with the company. Brand management and customer relations management (CRM) drive much of a company's focus on its interaction with the customer. When a virtual agent is not pleasing to interact with, a customer will have a negative impression of the company as represented by the virtual agent.
Movements of the head of a virtual agent must be natural or viewers will dislike the virtual agent. If the head movement is random, the impression to the human user is that the virtual agent is more synthetic rather than real. In some cases, the head appears to float over a background. This approach is judged by many viewers to be “eerie.”
One can try to interpret the meaning of the text with a natural language understanding tool and then derive some behavior from that. Yet, such an approach is usually not feasible, since natural language understanding is very unreliable. A wrong interpretation can do considerable harm to the animation. For example, if the face is smiling while articulating a sad or tragic message, the speaker comes across as cynical or mean spirited. Most viewers dislike such animations and may become upset.
An alternative approach is to use ‘canned’ animation patterns. This means that a few head motion patterns are stored and repeatedly applied. This can work for a short while, yet the repetitive nature of such animations soon annoys viewers.
Yet another approach is to provide recorded head movements for the virtual agent. While this improves the natural look of the virtual agent, unless those head movements are synchronized to the text being spoken, to the viewer the movements become unnatural and random.
Movement of the head of a virtual agent is occasionally mentioned in the literature but few details are given. See, e.g., Cassell, J, Sullivan, J. Prevost, S., Churchill, E., (eds.), “Embodied Conversational Agents”, MIT Press, Cambridge, 2000; Hadar, U., Steiner, T. J., Grant, E. C., Rose, F. C., “The timing of shifts in head postures during conversation”,Human Movement Science,3, pp. 237–245, 1984; and Parke, F. I., Waters, K., “Computer Facial Animation”, A. K. Peters, Wellesley, Mass., 1997.
Some have studied emotional expressions of faces and also describe non-emotional facial movements that mark syntactic elements of sentences, in particular sentence endings. But the emphasis is on head movements that are semantically driven, such as nods indicating agreement. See, e.g., Ekman, P., Friesen, W. V., “Manual for the Facial Action Coding System”, Consulting Psychologists Press, Palo Alto, 1978.
Conventionally, animations in virtual agents are controlled through interpretation of the text generated from a spoken dialog system that is used by a Text-to-Speech (TTS) module to generate the synthetic voice to carry on a conversation with a user. The system interprets the text and manually adds movements and expressions to the virtual agent.
Yet another attempt at providing virtual agent movement is illustrated by the FaceXpress development product available for virtual agents and offered through LifeFX®. The FaceXpress is a tool that enables a developer to control the expression of the virtual agent.FIG. 1 illustrates the use of thetool10. In this web-based version of the virtual agent development tool, the developer of the virtual agent organizes preprogrammed gestures, emotions and moods.Column12 illustrates theselected dialog14,gestures16 and other selectable features such aspunctuators32,actions34,attitudes36 andmoods38.Column18 illustrates the selectable features. Shown iscolumn18 when the gestures option is selected to disclose the available pre-programmedgestures smile20,frown40 andkiss42. The developer drags the desired gesture fromcolumn18 tocolumn22.Column22 shows the waveform of thetext24, atiming ruler44, the text spoken by thevirtual agent26 and rows for the various features of the agent, such as thesmile28. Amoveable amplitude button46 enables the developer to adjust the parameters of the smile feature. While this process enables the developer to control the features of a virtual agent, it is a time-consuming and costly process. Further, the process clearly will not enable a real-time conversation with a virtual agent where every facial movement must be generated live. With the increased capability of synthetic speech dialog systems being developed using advanced dialog management techniques that remove the necessity for preprogrammed virtual agent sentences, the opportunity to pre-program virtual agent movement will increasingly disappear.
The process of manually adding movements to the virtual agent is a slow and cumbersome process. Further, quicker systems do not provide a realistic visual movement that is acceptable to the user. The traditional methods of controlling virtual agent movement preclude the opportunity of engaging in a realistic interaction between a user and a virtual agent.
SUMMARY OF THE INVENTION
What is needed in the art is a new method of controlling head movement in a virtual agent such that the agent's movement is more natural and real. There are two parts to this process: (1) controlling the head movements of the virtual agent when the agent is talking, and (2) controlling the head movements when the virtual agent is listening. The related patent application referenced above relates to item (1). This disclosure primarily relates to item (2) and how to control head movements when the virtual agent is listening to a human speaker.
The present invention utilizes prosody to control the movement of the head of a virtual agent in a conversation with a human. Prosody relates to speech elements such as pauses, tone, pitch, timing effects and loudness. Using these elements to control the movement of the virtual agent head enables a more natural appearing interaction with the user.
One embodiment of the invention relates to a method of controlling the virtual agent that is listening to a user. The method comprises receiving speech data from the user, performing a prosodic analysis of the speech data, and controlling the virtual agent movement according to the prosodic analysis.
Other embodiments of the invention include a system or apparatus for controlling the virtual agent movement while listening to a user and a computer-readable medium storing a set of instructions for operating a computer device to control the head movements of a virtual agent when listening to a user.
The present invention enables animating head movements of virtual agents that are more convincing when a human is having an interactive conversation. When the facial expressions and head movements of the virtual agent respond essentially simultaneously to the speech of the user, the agent appears more like a human itself. This is important for producing convincing agents that can be used in customer service applications, e.g., for automating call centers with web-based user interfaces.
Being able to control the facial expressions and head movements automatically, without having to interpret the text or the situation, opens for the first time the possibility of creating photo-realistic animations automatically. For applications such as customer service, the visual impression of the animation has to be of high quality in order to please the customer. Many companies have tried to use visual text-to-speech in such applications but failed because the quality was not sufficient.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing advantages of the present invention will be apparent from the following detailed description of several embodiments of the invention with reference to the corresponding accompanying drawings, in which:
FIG. 1 illustrates a prior art method of generating gestures in a virtual agent,
FIG. 2 illustrates ToBI symbols used in marking pitch accents and phrase boundaries in prosody;
FIG. 3 illustrates an exemplary system for controlling movement of a virtual agent during a conversation with a user;
FIG. 4A illustrates an exemplary client/server-based virtual agent model over a network;
FIG. 4B illustrates another aspect of the client/server-based virtual agent model over a network;
FIG. 5 illustrates tables of pitch accents and phrase boundaries for a data set;
FIG. 6 illustrates the phonetic transcript with annotations of a sentence;
FIG. 7 illustrates an example of determining precise head movements for a virtual agent;
FIG. 8 shows the feature points on a virtual agent;
FIG. 9 illustrates a coordinate system used when providing movement to a virtual agent;
FIG. 10A shows an example of the head angle ax as a function of time;
FIG. 10B illustrates a high-pass filtered part of a signal inFIG. 10A;
FIG. 10C shows the same sentence spoken as inFIG. 10B, but with the instruction to talk with a cheerful expression; and
FIG. 11 illustrates a method of changing databases of virtual agent movements according to culture or language.
DETAILED DESCRIPTION OF THE INVENTION
In human-to-human conversations, many visual cues aid in the exchange of information and whose turn it is to speak. For example, when a first person is speaking and is about to finish, his or her facial expression and head movement provide visual cues that are recognized by the listener as indicating that it is the listener's turn to talk. Further, while the first person is speaking, the listener will often exhibit head motions such as nodding or smiling at the appropriate time to indicate that the words or meanings provided by the first person are heard and understood.
These kinds of very natural movements for humans that assist in an efficient and natural conversation are difficult to replicate when a human is speaking to a virtual animated entity or virtual agent.
Head movements, facial expressions, and gestures, applied by the speaker for underlining the meaning of the text, usually accompany speech. Such movements aid the understanding of the spoken text, but they also convey a lot of additional information about the speaker, such as the emotional state or the speaker's temper.
Mostly psychologists have studied nonverbal components in face-to-face communication extensively. Such studies typically link head and facial movements or gestures qualitatively to parts of the text. Many of the more prominent movements are clearly related to the content of spoken text or to the situation at hand. For example, much of the body language in conversations is used to facilitate turn taking. Other movements are applied to emphasize a point of view. Some movements serve basic biological needs, such as blinking to wet the eyes. Moreover, people always tend to move slightly to relax some muscles while others contract. Being completely still is unnatural for humans and requires considerable concentration.
Beside movements that are obviously related to the meaning of the text, many facial expressions and head shifts are tied more to the text's syntactic and prosodic structure. For example, a stress on a word is often accompanied by a nod of the head. A rising voice at the end of a phrase may be underlined with a rise of the head, possibly combined with rising eyebrows. These are the type of facial and head movements used according to the present invention. Since they are analogous to prosody in speech analysis, these kinds of facial and head movements are called “visual prosody.”
Speech prosody involves a complex array of phonetic parameters that express attitude, assumptions, and attention and can be represented as a parallel channel of information to the meaning of the speech. As such, prosody supports a listener's recovery of the basic message contained in speech as well as the speaker's attitude toward the message and the listener as well.
Little information exists about prosodic movements, and no quantitative results have been published that show how such head and facial movements correlate with elements of speech.
The object of the present invention is to synthesize naturally looking virtual agents and especially to synthesize a listening virtual agent as well as a virtual agent transitioning from listening to talking or from talking to listening. The virtual agent may be a person or any kind of entity such as an animal or other object like a robot.
Many of the classical animation techniques have only limited applicability for the types of talking heads describe here. Artists have long been able to express emotions and personality in cartoon characters with just a few strokes of a pen. See, e.g., Culhane, S,Animation: From Script to Screen, St. Martin's Press, New York, 1988.
However, as talking heads look more and more like real, recorded humans, viewers become more critical of small deviations from what is considered natural. For example, a cartoon character needs only very rough lip-sound synchronization to be perceived as pleasant. A photo-realistic head, on the other hand, has to show perfect synchronization. Otherwise the depicted person may seem to have a speech disability, which may be embarrassing to a viewer.
Emulating human behavior perfectly requires an understanding of the content of the text. Since many of the movements are closely coupled to prosodic elements of the text, the present invention relates to deriving naturally looking head movements using just the prosodic information.
Prosody describes the way speech is intonated with such elements as, but not limited to, pauses, the duration of the pauses, pitch, timing effects, and loudness. The details of the intonation are influenced by the personality of the speaker, by the speaker's emotional state, as well as by the content of the text. Yet, underneath personal variations lie well-defined rules that govern the intonation of a language. See, e.g. Huang, X., Acero, A., Hon, H.,Spoken Language Processing, Prentice Hall, 2001, pp. 739–791, incorporated herein. Predicting the prosody from text is one of the major tasks for text-to-speech synthesizers. Therefore, fairly reliable tools exist for this task.
The text according to the present invention can be recorded or received from a number of sources. For example, text can be recorded from (1) short sentences plus greetings, (2) sentences designed to cover all diphones in English, (3) short children's stories, and (4) paragraphs from sources such as the Wall Street Journal. As can be understood, text can come from any source.
In order to practice the present invention, a database may need to be developed. For example, a database containing about 1,075 sentences was compiled from recordings of six different speakers. Five speakers talked for about 15 minutes each, pronouncing text from the first two sources. The sixth speaker is recorded for over two hours, articulating the whole data set. In this latter case the speaker is also instructed to speak some of the text while expressing a number of different emotions.
A prosodic prediction tool identifies prosodic phrase boundaries and pitch accents on the whole database, i.e., labels the expected prosody. These events are labeled according to the ToBI (Tones and Break Indices) prosody classification scheme. For more information on the ToBI method, see, e.g., M. Beckman, J. Herschberg, The ToBI Annotation Conventions, and K. Silverman, M. Beckman, J. Pitrelli, M. Osterndorf, C. Wightman, P. Price, J. Pierrehumbert, J. Herschberg, “ToBI: A Standard for Labeling English Prosody”,Int. Conf. on Spoken Language Processing,1992, Banff, Canada, pp. 867–870.
ToBI labels not only denote accents and boundaries, but also associate them with a symbolic description of the pitch movement in their vicinity. The symbols indicate whether the fundamental frequency (F0) is rising or falling. The two-tone levels, high (H) and low (L), describe the pitch relative to the local pitch range.
FIG. 2 illustrates a first table having a symbol ofpitch accent column60 and acorresponding column62 for the movement of the pitch of the fundamental frequency. For example, H* is a symbol of the pitch accent having a corresponding movement of the pitch of high-to-upper end of the pitch range. Similarly, L* is a pitch accent symbol indicating a low-to-lower end of a pitch range.
The second table inFIG. 2 illustrates aphrase boundary column64 and a movement of thefundamental frequency column66. This table correlates a phase boundary, such as H—H % to a movement of pitch high and rising higher toward the end, typical for yes-no question. From correlations such as this, the ToBI symbols for marking pitch accents and phrase boundaries can be achieved. These pitch accents and phase boundaries can be used to control the movement of the virtual agent while listening and speaking to a user.
A conversation with a talking head will appear natural only if not just the speaking, or active, portion of the conversation is animated carefully with appropriate facial and head movements, but also the passive, or listening, part. Tests with talking heads indicate that one of the most frequent complaints relates to appropriate listening behavior.
The present invention addresses the issue of how to control the movement and expression on the face of an animated agent while listening. The invention solves the problem by controlling facial and head movements through prosodic and syntactic elements in the text entered by a user, i.e., the text that the talking head is supposed to listen to and ‘understand’. Adding listening visual prosody, i.e., proper facial and head movements while listening, makes the talking head seem to comprehend the human partner's input.
FIG. 3 illustrates asystem100 or apparatus for controlling a conversation between a user and a virtual agent, including the movement of the virtual agent while listening. As is known in the art, the various modules of the embodiments of the invention may operate on computer devices such as a personal computer, handheld wireless device, a client/server network configuration or other computer network. The particular configuration of the computer device or network, whether wireless or not, is immaterial to the present invention. The various modules and functions of the present invention may be implemented in a number of different ways.
Text or speech data, received from asource102, may be words spoken by a user in a conversation with a virtual agent or from any other source. From this speech data, aprosodic analysis module104 performs a syntactic analysis to determine and extract prosodic and syntactic patterns with the speech data.
After the prosodic structure of an utterance has been determined, this information has to be translated into movements of the head and facial parts. One way is to store many prosodic patterns and their corresponding head movements in a database. In a speaking, listening or transition mode for the virtual agent, when the system prepares to synthesize a sentence to be spoken by the virtual agent, the system searches for a sample in the database with similar prosodic events and select the corresponding head and facial movements. This produces very naturally looking animations. The aspect of the invention does require a large number of patterns have to be stored and the whole database has to be searched for every new animation.
The precise form of the head and facial movements is not critical and varies widely from person to person. What matters is that the movements are synchronized with the prosodic events. Such movements can be generated with a rule-based model or a finite state machine. For this approach, the inventors analyzed recorded sequences and determined the probability of particular head movements for each of the main prosodic events. For example, the system looks at how often a nod is happening on a stress, or a raised eyebrow at the end of a question. Using such a model, the system calculates for each prosodic event in the sequence the likelihood that a prominent head movement occurs. This model or rule-based approach can produce naturally looking sequences if enough samples are analyzed to determine all the probabilities properly. It has the advantage that it is computationally less costly than a sample-based approach.
Returning toFIG. 3, the prosodic data is transmitted to aselection module106 that selects associated or matching prosody or syntactic patterns from alistening database108. The listening database stores prosodic and syntactic patterns, as well as behavior patterns, that are appropriate for listening activity according to convention. Once theselection module106 selects the behavior patterns, the patterns are transmitted to the virtual agent-rendering device as listening and behavioral face andhead movements110.
As an example, if a user were to elevate his voice in a conversation with another human, the listening person may naturally pull back and raise his eyebrows at the outburst. Similarly, the listening database will store such behavioral patterns for appropriate responses to the detected prosody in the speech data directed at the virtual agent.
Once theselection module106 selects the behavioral patterns, the data is transmitted to the virtual agent in real-time to control the listening behavior, i.e., facial and head movements, of the virtual agent. For example, suppose a person is talking in a monotone voice for several minutes to the virtual agent. The behavior of the virtual agent according to the prosody of the monotone speech will be appropriate for such language. This may be, for example, minor movements and eye blinking. However, if the person suddenly yells at the virtual agent, the sudden change in the prosody of the speech would be immediately processed, and a change in the listening behavior pattern would shift in the virtual agent, and the virtual agent would exhibit a surprised or eye-brow-raising expression in response to the outburst.
In addition to listening and speaking behavior, there are also transition periods, which must be animated appropriately. When the input text stops, the behavior has to switch to a ‘planning’ or ‘preparation’ stage, where it is clear that the head is getting ready for a response. Control moves from thelistening selection module106 to thetransition selection module112.
Thetransition selection module112 controls the exchange of data between thetransition database114 and the modules that control the movement of thevirtual agent116. Theselection module112 matches prosody and syntactic patterns drawn from atransition database114. The transition database stores behavior patterns appropriate for transition behaviors. For example, when one speaker is done, certain movements of the head or behaviors will indicate the other person's turn to talk. If one person continues to talk and the other wants to speak, a certain transition visual behavior will indicate a desire to cut in and talk. Once the transition pattern is selected, theselection module112 transmits the data to the virtual agent for controlling the behavioral facial and head movements in real-time for a more natural experience for the user.
Modules110,116 and122 may represent rendering modules for directly controlling the rendering of the virtual agent, or they may act as modules for provide data to other rendering modules located on server or client devices for controlling the movements of the virtual agent.
Once a transition is complete, the conversation proceeds to the virtual agent's turn to talk. Here, theselection module118 will receive a phoneme string having prosodic andsyntactic patterns124 from atext generation module126. Those of skill in the art are aware of means for text generation in the context of a spoken dialog service.
The speakingmovement selection module118 uses the prosodic and syntactic patterns from the generated text to select from a speakingdatabase120 the appropriate prosodic and syntactic behavioral patterns to control the speaking behavior and facial movements of thevirtual agent122. At the end of text output from the virtual agent, the virtual agent should signal to the viewer that now it is his or her turn to speak.
The speaking database may comprise, for example, an audio-visual database of recorded speech. The database may be organized in a variety of ways. The database may include segments of audio-visual speech of a person reading text that includes an audio and video component. Thus, during times of listening, speaking or transition, the system searches the speaking database and selects segments of matching visual prosody patters from the database and the system controls the virtual agent movements according to the movements of the person recorded in the selected audio-visual recorded speech segments. In this manner, the user can experience a more realistic and natural movement of the virtual agent.
In another aspect of the invention, the speaking behavior database is not utilized and a model is used for determining virtual agent movements according to speech data. The model may be automatically trained or be handcrafted. Using the model approach, however, provides a different means of determining virtual agent movement based on speech data than the look-up speaking database. Similar models may be employed for listening movements as well as transition movements. Thus, there are a variety of different ways wherein a system can associate and coordinate virtual agent movement for speaking, listening and transition segments of a human-computer dialog.
Aconversation control module128 controls the interaction between the text generation module126 (voice and content of virtual agent) and receiving the text or speech data from thesource102. These modules preferably exist and are operational on a computer server or servers. The particular kind of computer device or network on which these modules run is immaterial to the present invention. For example, they may be on an intranet, or the Internet or operational over a wireless network. Those of skill in the art will understand that other dialog modules are related to the conversation module and may include an automatic speech recognition module, a spoken language understanding module, a dialog management module, a presentation module and a text-to-speech module.
FIG. 4A illustrates an aspect of the invention in a network context. This aspect relates to a client/server configuration over a packet network, Internet Protocol network, or theInternet142. Further, thenetwork142 may refer to a wireless network wherein theclient device140A transmits via a wireless protocol such as Bluetooth, GSM, CDMA, TDMA or other wireless protocol to a base station that communicates with theserver144A. U.S. Pat. No. 6,366,886 B1, incorporated herein by reference, includes details regarding packet networks and ASR over packet networks.
InFIG. 4A, the prosodic analysis for both the virtual agent listening, transition, and response mode is performed over thenetwork142 at aserver144A. This may be the requirement where the client device is a small hand-held device with limited computing memory capabilities. As an example of the communication in this regard, the client device includes means for receiving speech from auser139. This, as will be understood by those of skill in the art, may comprise a microphone and speech processing, voice coder and wireless technologies to enable the received speech to the transmitted over thenetwork142 to theserver144A. Acontrol module144 handles the processes required to receive the user speech and transmit data associated with the speech across thenetwork142 to theserver144A.
Theserver144A includes aprosodic analysis module146 that analyzes the prosodic elements of the received speech. According to this prosodic analysis, in real time, alistening behavior module148 in theserver144A transmits data associated with controlling the head movements of thevirtual agent160. Thus, while theuser139 speaks to thevirtual agent160, it appears that the agent is “listening.” The listening behavior includes any behavior up to and through a transition from listening to preparing to speak. Therefore, data transmitted as shown inmodule148 includes transition movements from listening to speaking.
Next, aresponse module150 generates a response to the user's speech or question. The response, as is known in the art, may be generated according to processes performed by an automatic speech recognition module, a spoken language understanding module, a dialog manager module, a presentation manager, and/or a text-to-speech module. (FIG. 4B shows these modules in more detail.) As the response is transmitted to theclient device140A over thenetwork150, theserver144A performs a prosodic analysis on theresponse152 such that theclient device140A receives and presents the appropriate real-time responsive behavior of thevirtual agent160 such as facial movements and expressions stored in theresponse behavior module154 and associated with the text of the response.
A realistic conversation, including the visual experience of watching avirtual agent160 on theclient device140A, takes place between the user and thevirtual agent160 over thenetwork142.
The transmission over anetwork142 such as a packet network is not limited to cases where prosody is the primary basis for generating movements. For example, in some cases, thevirtual agent160 responses are preprogrammed such that the response and thevirtual agent160 motion are known in advance. In this case, then the data associated with the response as well as the head movements are both transmitted over thenetwork142 in the response phase of the conversation between the user and thevirtual agent160.
FIG. 4B illustrates another aspect of the network context of the present invention. In this case, theclient device140B includes acontrol module143 that receives the speech from theuser139. The control module transmits the speech over thenetwork142 to theserver144B. Concurrently, the control module transmits speech data to aprosodic analysis module145. The listening behavior is controlled in this aspect of the invention on theclient device140B. Accordingly, while theperson139 is speaking, the modules on theclient device140B control the prosodic analysis, movement selection, and transition movement. By locally processing the listening behavior of thevirtual agent160, a more real-time experience is provided to theuser139. Further, this isolates the client device and virtual agent listening behavior from network transmission traffic slow-downs.
Theserver144B performs the necessary processing to carry on a dialog with theuser139, including automatic speech recognition (ASR)149, spoken language understanding (SLU)151, dialog management (DM)153, text-to-speech processing (TTS)155, and prosodic analysis and virtualagent movement control157 for the response of thevirtual agent160.
The present aspects of the invention are not limited to the specific processing examples provided above. The combination of prosodic analysis and other ASR, SLU, DM and TTS processes necessary to carry out a spoken and visual dialog with theuser139 may be shared in any combination between theclient device140B and theserver144B.
In another aspect of the invention shown inFIG. 4B, the prosodic analysis module can control the virtual agent movement both for when the virtual agent listens and speaks. In this variation of the invention, preferably, theprosodic analysis module145 on the client device receives and analyzes the TTS speech data from theserver144B. The movement of thevirtual agent160 while the virtual agent is speaking or transitioning from speaking to listening or listening to speaking is thus entirely controlled by software operating on theclient device140B. Themovement control module157 on the server may or may not be operative or exist in this aspect of the invention since all movement behavior is processed locally.
Recording real people and correlating their behavior with the prosodic information in the text enables the automation of the process of generating facial expressions and head movements. Prosodic information can be extracted reliably from text without the need of a high-level analysis of the semantic content. Measurements confirm a strong correlation between prosodic information and behavioral patterns. This is true not only for the correlation between behavioral patterns and the text spoken by a person, but also for the correlation between text spoken by one person and the behavioral patterns of the listener.
Accents within spoken text are prime candidates for placing prominent head movements. Hence, their reliable identification is of main interest here. Stress within isolated words has been compiled in lexica for many different languages. Within continuous speech, however, the accents are not necessarily placed at the location of the lexical stress. Context or the desire to highlight specific parts of a sentence may shift the place of an accent. It is therefore necessary to consider the whole sentence in order to predict where accents will appear.
Any interruption of the speech flow is another event predestined for placing head or facial movements. Many disfluencies in speech are unpredictable events, such as a speaker's hesitations. ‘ah’ or ‘uh’ are often inserted spontaneously into the flow of speech. However, other short interruptions are predictably placed at phrase boundaries. Prosodic phrases, which are meaningful units, make it easier for the listener to follow. That is why prosodic phrase boundaries often coincide with major syntactic boundaries and punctuation marks.FIG. 5 shows the types of boundaries predicted by the prosody tool, andFIG. 6 illustrates how often they appear in the text. With each phrase boundary, a specific type of pitch movement is associated. This is of special interest here since it allows, for example, adding a rise of the head to a rising pitch. Such synchronizations can give a virtual agent the appearance of actually ‘understanding’ the text being spoken by the virtual agent. Identifying these types of boundaries can further provide the real-time appearance of actively listening to the speaking user.
As shown inFIG. 5,column190 illustrates the pitch accents, such as H*, andcolumn192 shows the corresponding number of events for the data set of 1075 sentences.Column194 shows the phrase boundary such as L—L % and thecorresponding column196 shows the number of events in the data set.
FIG. 6 illustrates the phonetic transcription and prosodic annotation of the sentence “I'm your virtual secretary.”Column200 illustrates the time until the end of the phone.Column202 shows the corresponding phone.Column204 shows the prosodic event, where applicable for a phone.Column206 illustrates the associated word in the sentence with respect to time, phone and prosodic event of the other three columns.
In this case, the phone durations incolumn200 were extracted from the spoken text with a phone-labeling tool. Alternatively, the prosody analysis tools can predict phone durations from the text. Accents are shown here at the height of the last phone of a syllable, but it has to be understood that the syllable as a whole is considered accented and not an individual phone.
Of the different accent types, the H* accents strongly dominate (seeFIG. 6). Moreover, the prediction of the other types of accents is not very reliable. Studies show that even experienced human labelers agree in less than 60% on the accent types. Therefore the present invention does not differentiate between the various types of pitch accents and lumps them all together simply as accents.
The prosody predictor according to research associated with the present invention has been trained with ToBI hand labels for 1,477 utterances of one speaker. The accent yes/no decision is correct in 89% of all syllables and the yes/no decision for phrase boundaries in 95%. The accent types are predicted correctly in 59% of all syllables and the boundary types in 74% of all cases. All the speakers recorded are different from the speaker used to train the prosody predictor. One voice can be used as well to train for prosody prediction.
Associated with the present invention is gathering data on face recognition from human readers. Natural head and facial movements while reading text provide the information for generating the head movements of a virtual agent. Hence, when using a human speaker, the speakers must be able to move their heads freely while they pronounce text. It is preferable that no sensors on the person's head be used while gathering the human data. The natural features of each human face are used since no markers or other artifacts for aiding the recognition systems are used.
In an exemplary data gathering method, recordings are done with the speaker sitting in front of a teleprompter, looking straight into a camera. The frame size is 720×480 pixels and the head's height is typically about ⅔ of the frame height. The total of the recordings corresponds to 3 hours and 15 minutes of text. Facial features are extracted from these videos and head poses for each of over 700,000 frames. Recordings may be done at, for example, 60 frames per second.
In order to determine the precise head movements as well as the movements of facial parts, the positions of several facial feature points must be measured with a high accuracy. A face recognition system according to the present invention proceeds in multiple steps, each one refining the precision of the previous step. See, e.g., “Face Analysis for the Synthesis of Photo-Realistic Talking Head,” Graf, H. P., Cosatto, E. and Ezzat, T.,Proc. Fourth IEEE Int. Conf. Automatic Face and Gesture Recognition, Grenoble, France, IEEE Computer Society, Los Alamitos, 2000, pp. 189–194.
Using motion, color and shape information, the head's position and the location of the main facial features are determined first with a low accuracy. Then, smaller areas are searched with a set of matched filters, in order to identify specific feature points with high precision.FIG. 7 shows an example of this process using a portion of avirtual agent face210. Representative samples of feature points may comprise, for example,eye corners212 and214 withcorresponding points212aand214aon theimage210 and eye edges216 and218 with corresponding reference points216aand218aon theimage210. Theseimages212,214,216 and218 are cut fromimage210. By averaging three of these images and band-pass filtering the result, these sample images or kernels become less sensitive to the appearance in one particular image. Such sample images or filter kernels are scanned over an area to identify the exact location f a particular feature, for example, an eye corner. A set of such kernels is generated to cover the appearances of the feature points in all different situations. For example, nine different instances of each mouth corner are recorded, covering three different widths and three different heights of the mouth.
When the system analyzes a new image, the first steps of the face recognition, namely shape and color analysis, provide information such as how wide open the mouth is. One can therefore select kernels of mouth corners corresponding to a mouth of similar proportions. Image and kernels are Fourier transformed for the convolution, which is computationally more efficient for larger kernels. In this way a whole set of filter kernels is scanned over the image, identifying the feature points.
The head pose is calculated from the location of the eye corners and the nostrils in the image.FIG. 8 shows an example of identifiedfeature points232 in the image and asynthetic face model234 in the same pose. Under these conditions, the accuracy of the feature points232 must be better than one pixel; otherwise the resulting head pose may be off by more than one degree, and the measurements become too noisy for a reliable analysis.
There is a tradeoff between accuracy and selectivity of the filters. Larger filter kernels tend to be more accurate, yet they are more selective. For example, when the head rotates, the more selective filters are useable over a smaller range of orientations. Hence, more different filters have to be prepared. Preferably, the system typically tunes the filters to provide an average precision of between one and one and a half pixels. Then the positions are filtered over time to improve the accuracy to better than one pixel. Some events, for example eye blinks, can be so rapid that a filtering over time distorts the measurements too much. Such events are marked and the pose calculation is suspended for a few frames.
Beside the head pose, the present invention focuses on the positions of the eyebrows, the shape of the eyes and the direction of gaze. These facial parts move extensively during speech and are a major part of any visual expression of a speaker's face. They are measured with similar filters as described above. They do not need to be measured with the same precision as the features used for measuring head poses. Whether eyebrows move up one pixel more or less does not change the face's appearance markedly.
The first part of our face analysis, where the head and facial parts are measured with a low accuracy, works well for any face. Sufficient redundancy is built into the system to handle even glasses and beards. The filters for measuring feature positions with high accuracy, on the other hand, are designed specifically for each person, using samples of that person's face.
For identifying prosodic movements, the rotation angles of the head around the x-, y-, and z-axis are determined, together with the translations.FIG. 9 shows the orientation of the coordinate system used for these measurements. For example, ax, ay, az mark the rotations around the x, y, z axes.
All the recorded head and facial movements were added spontaneously by the speakers while they were reading from the teleprompter. The speakers were not aware that the head movements would be analyzed. For most of the recordings the speakers were asked to show a ‘neutral’ emotional state.
For the analysis, each of the six signals representing rotations and translations of the head are split into two frequency bands: (1) 0–2 Hz: slow head movements and (2) 2 Hz–15 Hz: faster head movements associated with speech.
Movements in the low frequency range extend over several syllables and often over multiple words. Such movements tend to be caused by a change of posture by the speaker, rather than being related to the speech.FIGS. 10A–10C are examples of the head angle ax as a frequency of time.FIG. 10A shows agraph250 of theoriginal signal252 and a low-pass filteredsignal254. The time on the horizontal axis is given in frame numbers with 30 frames a second.
The faster movements, on the other hand, are closely related to the prosody of the text. Accents are often underlined with nods that extend typically over two to four phones. This pattern is clearly visible inFIG. 10B. In thisgraph260, thedata262 represents the high-pass filtered part of theoriginal signal254 inFIG. 10A. The markings inFIG. 10B represent phone boundaries with frame numbers at the top of the graph. Phones and prosodic events are shown below the graph. Here the nods are very clearly synchronized with the pitch accents (positive values for angle ax correspond to down movements of the head). Typical for visual prosody, and something observed for most speakers, is that the same motion—in this case a nod—is repeated several times. Not only are such motion patterns repeated within a sentence, but often over an extended period of time—sometimes as much as a whole recording session, lasting about half an hour.
A further characteristic feature of visual prosody is the initial head movement, leading into a speech segment after a pause.FIG. 10B shows this as a slight down movement of the head (ax slightly positive)264, followed by an upward nod at the start of thesentence266. While developing the present invention, the applicant recorded50 sentences of the same type of greetings and short expression in one recording session. The speaker whose record is shown inFIGS. 10A–10B executed the same initial motion pattern in over 70% of these sentences.
InFIGS. 10A–10C, only the rotation around the x-axis, ax, is shown. In this recording the rotation ax, i.e., nods, was by far the strongest signal. Many speakers emphasize nods, but rotations around the y-axis are quite common as well, while significant rotations around the z-axis are rare. A combination of ax and ay, which leads to diagonal head movements, is also observed often.
The mechanics for rotations around each of the three axes are different and, consequently, the details of the motion patterns vary somewhat. Yet, the main characteristics of all three of these rotations are similar and can be summarized with some basic patterns:
    • 1. Nod, i.e., an abrupt swing of the head with a similarly abrupt motion back. Nod with an overshoot at the return, i.e., the pattern looks like an ‘S’ lying on its side.
    • 2. Abrupt swing of the head without the back motion. Sometimes the rotation moves slowly, barely visibly, back to the original pose; sometimes it is followed by an abrupt motion back after some delay.
Summarizing these patterns, where each one can be executed around the x-, y-, or z-axis:
    • ^ nod (around one axis)
    • ˜ nod with overshoot
    • / abrupt swing in one direction
Having such motion primitives allows describing head movements with the primitives' types, amplitudes and durations. This provides a simple framework for characterizing a wide variety of head movements with just a few numbers. Table 1 shows some statistical data of the appearance of these primitives in one part of the text database. This illustrates the percentage of pitch accents accompanied by a major head movement. The text corpus associated with this table was 100 short sentences and greetings.
TABLE 1
P ({circumflex over ( )}x| *)42%
P (~x| *)18%
P (/x| *)20%
The amplitudes of the movements can vary substantially, as is illustrated by thegraph270 inFIG. 10C. For this recording the speaker was asked to articulate the same sentence as inFIG. 10B, but with a cheerful expression. The initial head motion is a wide down-and-up swing of thehead274, which runs over the first nod seen inFIG. 10B. The first nod falls down on thesecond accent276 and the sentence ends with an up-down swing278.
The patterns described here are not always as clearly visible as in the graphs ofFIGS. 10A–10C. Some speakers show far fewer prosodic head movements than others. The type of text also influences prosodic head movements. When reading paragraphs from the Wall Street Journal, the head movements were typically less pronounced than for the greeting sentences. On the other hand, when speakers have to concentrate strongly, such as while reading a demanding text, they often exhibit very repetitive prosodic patterns.
Head and facial movements during speech exhibit a wide variety of patterns that depend on personality, mood, content of the text being spoken, and other factors. Despite large variations from person to person, patterns of head and facial movements are strongly correlated with the prosodic structure of the text. Angles and amplitudes of the head movements vary widely, yet their timing shows surprising consistency. Similarly, raised eyebrows are often placed at prosodic events—sometimes with head nods, at other times without. Visual prosody is not nearly as rigidly defined as acoustic prosody, but it is clearly identifiable in the speech of most people.
Recent progress in face recognition enables an automatic registration of head and facial movements and opens the opportunity to analyze them quantitatively without any intrusive measuring devices. Such information is a key ingredient for further progress in synthesizing natural looking talking heads. Lip-sound synchronization has reached a stage where most viewers judge it as natural. The next step of improvement lies in realistic behavioral patterns. Several sequences were synthesized where the head movements consisted of concatenations of the motion primitives described above. With good motion-prosody synchronization, the heads look much more engaging and even give the illusion that they ‘understand’ what they articulate and that they actively listen to the speaker.
Prosody prediction tools exist for several languages and the present invention is applicable to any language for which such tools are available. Even if there are no prosody tools available for a particular language, it may be possible to generate one. Typically the generation of a prosody prediction tool is much simpler than a language understanding tool. If no good prosody prediction tools exist, many prosodic elements can still be predicted from the syntactic structure of the text. Therefore the concept of visual prosody is applicable to any language.
In another aspect of the invention, it is noted that the basic elements of prosody such as pauses, pitch, rate or relative duration, and loudness, are culturally driven in that different cultures will attach different meanings to prosodic elements. Accordingly, an aspect of this invention is to provide a system and method of adapting prosodic speaking, transition and listening movements of a virtual agent that adapt to the appropriate culture of the speaker.
Further gradations may include differences in dialect or accents wherein the prosodic responses may be different, for example, between a person from New York City and Georgia.
In this embodiment of the invention, a database of speaking, transition, and listening movements as described above is compiled for each set of cultural possibilities. For example, an English set is generated as well as a Japanese set. As shown inFIG. 11, a language determination module will determine the language of the speaker (280). This may be via speech recognition or via a dialog with the user wherein the user indicates what language or culture is desired.
Suppose the speaker selects Japanese via the determination module. The system then knows that to enable a natural looking virtual agent in the conversation using the prosodic nature of the speech it will receive, the appropriate set of speaking, listening and transition databases must be selected. An example of such a change may be that the virtual agent would bow at the culturally appropriate times in one conversation; and where the user is from a different culture, then those agent movements would not be displayed during that conversation. A selection module (282) then selects that appropriate set of databases (284) for use in the conversation with the speaker.
The system proceeds then to modify, if necessary, the prosodic-driven movements of the virtual agent according to the selected databases (284) such that the Japanese user will experience a more natural conversation. The system also operates dynamically wherein if the user part-way through a conversation switches to English, ends the conversation, or indicates a different cultural set, then the process returns to the language determination module (280) for making that switch, and then continuing to select the appropriate set of databases for proceeding with the conversation with the user.
The number of databases is only limited by the storage space. Databases for Spanish, English with a New York Accent, English with a Southern Accent, Japanese, Chinese, Arabic, French, Senior Citizen, Teenager, Ethnicity, etc. may be stored and ready for the specific cultural movement that will appeal most to the particular user.
Although the above description may contain specific details, they should not be construed as limiting the claims in any way. Other configurations of the described embodiments of the invention are part of the scope of this invention. For example, any electronic communication between people where an animated entity can deliver the message is contemplated. Email and instant messaging and other forms of communication are being developed such as broadband3G and4G wireless technologies wherein animated entities as generally described herein may apply. Further, visual prosody contains a strong personality component and some trademark movements may be associated with certain personalities. The system can be trained to exhibit prosodic behavior of a particular person or one that is considered more ‘generic’. Accordingly, the appended claims and their legal equivalents should only define the invention, rather than any specific examples given.

Claims (42)

US10/173,3412002-05-162002-06-17System and method of providing conversational visual prosody for talking headsExpired - LifetimeUS7076430B1 (en)

Priority Applications (5)

Application NumberPriority DateFiling DateTitle
US10/173,341US7076430B1 (en)2002-05-162002-06-17System and method of providing conversational visual prosody for talking heads
US11/237,561US7349852B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US11/237,557US7353177B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US12/019,974US8200493B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads
US12/020,049US7844467B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads

Applications Claiming Priority (2)

Application NumberPriority DateFiling DateTitle
US38095202P2002-05-162002-05-16
US10/173,341US7076430B1 (en)2002-05-162002-06-17System and method of providing conversational visual prosody for talking heads

Related Child Applications (2)

Application NumberTitlePriority DateFiling Date
US11/237,561ContinuationUS7349852B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US11/237,557ContinuationUS7353177B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads

Publications (1)

Publication NumberPublication Date
US7076430B1true US7076430B1 (en)2006-07-11

Family

ID=36126687

Family Applications (5)

Application NumberTitlePriority DateFiling Date
US10/173,341Expired - LifetimeUS7076430B1 (en)2002-05-162002-06-17System and method of providing conversational visual prosody for talking heads
US11/237,561Expired - LifetimeUS7349852B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US11/237,557Expired - LifetimeUS7353177B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US12/019,974Expired - Fee RelatedUS8200493B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads
US12/020,049Expired - LifetimeUS7844467B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads

Family Applications After (4)

Application NumberTitlePriority DateFiling Date
US11/237,561Expired - LifetimeUS7349852B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US11/237,557Expired - LifetimeUS7353177B2 (en)2002-05-162005-09-28System and method of providing conversational visual prosody for talking heads
US12/019,974Expired - Fee RelatedUS8200493B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads
US12/020,049Expired - LifetimeUS7844467B1 (en)2002-05-162008-01-25System and method of providing conversational visual prosody for talking heads

Country Status (1)

CountryLink
US (5)US7076430B1 (en)

Cited By (42)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20020084902A1 (en)*2000-12-292002-07-04Zadrozny Wlodek W.Translator for infants and toddlers
US20040176957A1 (en)*2003-03-032004-09-09International Business Machines CorporationMethod and system for generating natural sounding concatenative synthetic speech
US20040230410A1 (en)*2003-05-132004-11-18Harless William G.Method and system for simulated interactive conversation
US20050043956A1 (en)*2003-07-032005-02-24Sony CorporationSpeech communiction system and method, and robot apparatus
US20050069852A1 (en)*2003-09-252005-03-31International Business Machines CorporationTranslating emotion to braille, emoticons and other special symbols
US20050131744A1 (en)*2003-12-102005-06-16International Business Machines CorporationApparatus, system and method of automatically identifying participants at a videoconference who exhibit a particular expression
US20050131697A1 (en)*2003-12-102005-06-16International Business Machines CorporationSpeech improving apparatus, system and method
US20050256712A1 (en)*2003-02-192005-11-17Maki YamadaSpeech recognition device and speech recognition method
US20060217979A1 (en)*2005-03-222006-09-28Microsoft CorporationNLP tool to dynamically create movies/animated scenes
US20070036334A1 (en)*2005-04-222007-02-15Culbertson Robert FSystem and method for intelligent service agent using VoIP
US20070201639A1 (en)*2006-02-142007-08-30Samsung Electronics Co., Ltd.System and method for controlling voice detection of network terminal
US20080120548A1 (en)*2006-11-222008-05-22Mark MoritaSystem And Method For Processing User Interaction Information From Multiple Media Sources
US20080154594A1 (en)*2006-12-262008-06-26Nobuyasu ItohMethod for segmenting utterances by using partner's response
US20080215325A1 (en)*2006-12-272008-09-04Hiroshi HoriiTechnique for accurately detecting system failure
US20080313130A1 (en)*2007-06-142008-12-18Northwestern UniversityMethod and System for Retrieving, Selecting, and Presenting Compelling Stories form Online Sources
US20090182702A1 (en)*2008-01-152009-07-16Miller Tanya MActive Lab
US20090234639A1 (en)*2006-02-012009-09-17Hr3D Pty LtdHuman-Like Response Emulator
US20100036660A1 (en)*2004-12-032010-02-11Phoenix Solutions, Inc.Emotion Detection Device and Method for Use in Distributed Systems
US20100082345A1 (en)*2008-09-262010-04-01Microsoft CorporationSpeech and text driven hmm-based body animation synthesis
US20140127662A1 (en)*2006-07-122014-05-08Frederick W. KronComputerized medical training system
US20150025891A1 (en)*2007-03-202015-01-22Nuance Communications, Inc.Method and system for text-to-speech synthesis with personalized voice
US9020816B2 (en)2008-08-142015-04-2821Ct, Inc.Hidden markov model for speech processing with training method
US9301722B1 (en)*2014-02-032016-04-05Toyota Jidosha Kabushiki KaishaGuiding computational perception through a shared auditory space
US20160118050A1 (en)*2014-10-242016-04-28Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayi Ticaret Anonim SirketiNon-standard speech detection system and method
US20160217500A1 (en)*2015-01-232016-07-28Conversica, LlcSystems and methods for management of automated dynamic messaging
US9536049B2 (en)2012-09-072017-01-03Next It CorporationConversational virtual healthcare assistant
US9552350B2 (en)2009-09-222017-01-24Next It CorporationVirtual assistant conversations for ambiguous user input and goals
US9823811B2 (en)2013-12-312017-11-21Next It CorporationVirtual assistant team identification
US9836177B2 (en)2011-12-302017-12-05Next IT Innovation Labs, LLCProviding variable responses in a virtual-assistant environment
US10210454B2 (en)2010-10-112019-02-19Verint Americas Inc.System and method for providing distributed intelligent assistance
US10379712B2 (en)2012-04-182019-08-13Verint Americas Inc.Conversation user interface
US10445115B2 (en)2013-04-182019-10-15Verint Americas Inc.Virtual assistant focused user interfaces
US10489434B2 (en)2008-12-122019-11-26Verint Americas Inc.Leveraging concepts with information retrieval techniques and knowledge bases
US10545648B2 (en)2014-09-092020-01-28Verint Americas Inc.Evaluating conversation data based on risk factors
US10586369B1 (en)2018-01-312020-03-10Amazon Technologies, Inc.Using dialog and contextual data of a virtual reality environment to create metadata to drive avatar animation
US11196863B2 (en)2018-10-242021-12-07Verint Americas Inc.Method and system for virtual assistant conversations
US11301632B2 (en)2015-01-232022-04-12Conversica, Inc.Systems and methods for natural language processing and classification
US11341962B2 (en)2010-05-132022-05-24Poltorak Technologies LlcElectronic personal interactive device
US11551188B2 (en)2015-01-232023-01-10Conversica, Inc.Systems and methods for improved automated conversations with attendant actions
US11568175B2 (en)2018-09-072023-01-31Verint Americas Inc.Dynamic intent classification based on environment variables
US11663409B2 (en)2015-01-232023-05-30Conversica, Inc.Systems and methods for training machine learning models using active learning
US11989521B2 (en)2018-10-192024-05-21Verint Americas Inc.Natural language processing with non-ontological hierarchy models

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7076430B1 (en)*2002-05-162006-07-11At&T Corp.System and method of providing conversational visual prosody for talking heads
US7386799B1 (en)2002-11-212008-06-10Forterra Systems, Inc.Cinematic techniques in avatar-centric communication during a multi-user online simulation
US7827034B1 (en)*2002-11-272010-11-02Totalsynch, LlcText-derived speech animation tool
KR100906136B1 (en)*2003-12-122009-07-07닛본 덴끼 가부시끼가이샤Information processing robot
US7613613B2 (en)*2004-12-102009-11-03Microsoft CorporationMethod and system for converting text to lip-synchronized speech in real time
MX2007009044A (en)2005-01-282008-01-16Breakthrough Performance TechnSystems and methods for computerized interactive training.
GB2427109B (en)*2005-05-302007-08-01Kyocera CorpAudio output apparatus, document reading method, and mobile terminal
US20070055526A1 (en)*2005-08-252007-03-08International Business Machines CorporationMethod, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
US7752043B2 (en)2006-09-292010-07-06Verint Americas Inc.Multi-pass speech analytics
US8571463B2 (en)*2007-01-302013-10-29Breakthrough Performancetech, LlcSystems and methods for computerized interactive skill training
US20080221892A1 (en)*2007-03-062008-09-11Paco Xander NathanSystems and methods for an autonomous avatar driver
WO2008119078A2 (en)*2007-03-282008-10-02Breakthrough Performance Technologies, LlcSystems and methods for computerized interactive training
WO2008141125A1 (en)*2007-05-102008-11-20The Trustees Of Columbia University In The City Of New YorkMethods and systems for creating speech-enabled avatars
MX2011001060A (en)2008-07-282011-07-29Breakthrough Performancetech LlcSystems and methods for computerized interactive skill training.
JP5408134B2 (en)*2008-08-132014-02-05日本電気株式会社 Speech synthesis system
WO2010018648A1 (en)*2008-08-132010-02-18日本電気株式会社Voice synthesis system
US8719016B1 (en)2009-04-072014-05-06Verint Americas Inc.Speech analytics system and system and method for determining structured speech
US20120016661A1 (en)*2010-07-192012-01-19Eyal PinkasSystem, method and device for intelligent textual conversation system
US8937620B1 (en)2011-04-072015-01-20Google Inc.System and methods for generation and control of story animation
US8887047B2 (en)2011-06-242014-11-11Breakthrough Performancetech, LlcMethods and systems for dynamically generating a training program
KR101358999B1 (en)*2011-11-212014-02-07(주) 퓨처로봇method and system for multi language speech in charactor
ITPE20130004A1 (en)*2013-03-052014-09-06Blue Cinema Tv Sas Di Daniele Balda Cci & C PROCEDURE FOR THE IMPLEMENTATION OF AN INTERACTIVE AUDIOVISUAL INTERFACE THAT REPRODUCES HUMAN BEINGS
US9104780B2 (en)2013-03-152015-08-11Kamazooie Development CorporationSystem and method for natural language processing
US9280147B2 (en)*2013-07-192016-03-08University Of Notre Dame Du LacSystem and method for robotic patient synthesis
US11404170B2 (en)*2016-04-182022-08-02Soap, Inc.Method and system for patients data collection and analysis
US10516938B2 (en)*2016-07-162019-12-24Ron ZassSystem and method for assessing speaker spatial orientation
US11145217B2 (en)*2017-09-212021-10-12Fujitsu LimitedAutonomous speech and language assessment
US20200279553A1 (en)*2019-02-282020-09-03Microsoft Technology Licensing, LlcLinguistic style matching agent
US11817086B2 (en)*2020-03-132023-11-14Xerox CorporationMachine learning used to detect alignment and misalignment in conversation
EP4181120A4 (en)*2020-11-252024-01-10Samsung Electronics Co., Ltd.Electronic device for generating response to user input and operation method of same
KR102546532B1 (en)*2021-06-302023-06-22주식회사 딥브레인에이아이Method for providing speech video and computing device for executing the method
US12254548B1 (en)*2022-12-162025-03-18Amazon Technologies, Inc.Listener animation

Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4884972A (en)1986-11-261989-12-05Bright Star Technology, Inc.Speech synchronized animation
US5652828A (en)1993-03-191997-07-29Nynex Science & Technology, Inc.Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
WO1999039281A2 (en)*1998-01-301999-08-05Easynet Access Inc.Personalized internet interaction
US5983190A (en)*1997-05-191999-11-09Microsoft CorporationClient server animation system for managing interactive user interface characters
US5987415A (en)*1998-03-231999-11-16Microsoft CorporationModeling a user's emotion and personality in a computer user interface
US6081780A (en)1998-04-282000-06-27International Business Machines CorporationTTS and prosody based authoring system
US6112177A (en)1997-11-072000-08-29At&T Corp.Coarticulation method for audio-visual text-to-speech synthesis
US6389396B1 (en)*1997-03-252002-05-14Telia AbDevice and method for prosody generation at visual synthesis
US6665643B1 (en)*1998-10-072003-12-16Telecom Italia Lab S.P.A.Method of and apparatus for animation, driven by an audio signal, of a synthesized model of a human face
US6735566B1 (en)*1998-10-092004-05-11Mitsubishi Electric Research Laboratories, Inc.Generating realistic facial animation from speech
US20050033582A1 (en)*2001-02-282005-02-10Michael GaddSpoken language interface

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
AU8247598A (en)1997-06-271999-01-19Pachal, JanBiopsy method and device
US6522333B1 (en)*1999-10-082003-02-18Electronic Arts Inc.Remote communication through visual representations
US7076430B1 (en)*2002-05-162006-07-11At&T Corp.System and method of providing conversational visual prosody for talking heads

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4884972A (en)1986-11-261989-12-05Bright Star Technology, Inc.Speech synchronized animation
US5652828A (en)1993-03-191997-07-29Nynex Science & Technology, Inc.Automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
US6389396B1 (en)*1997-03-252002-05-14Telia AbDevice and method for prosody generation at visual synthesis
US5983190A (en)*1997-05-191999-11-09Microsoft CorporationClient server animation system for managing interactive user interface characters
US6112177A (en)1997-11-072000-08-29At&T Corp.Coarticulation method for audio-visual text-to-speech synthesis
WO1999039281A2 (en)*1998-01-301999-08-05Easynet Access Inc.Personalized internet interaction
US5987415A (en)*1998-03-231999-11-16Microsoft CorporationModeling a user's emotion and personality in a computer user interface
US6081780A (en)1998-04-282000-06-27International Business Machines CorporationTTS and prosody based authoring system
US6665643B1 (en)*1998-10-072003-12-16Telecom Italia Lab S.P.A.Method of and apparatus for animation, driven by an audio signal, of a synthesized model of a human face
US6735566B1 (en)*1998-10-092004-05-11Mitsubishi Electric Research Laboratories, Inc.Generating realistic facial animation from speech
US20050033582A1 (en)*2001-02-282005-02-10Michael GaddSpoken language interface

Non-Patent Citations (18)

* Cited by examiner, † Cited by third party
Title
"Computer Facial Animation" by Parke, F. I., Waters, K., A.. K. Peters. Wellesley, Massachusetts, 1997.
"Embodied Conversational Agents", Cassell, J., Sullivan, J. Prevost, S., Churchill, E., (eds.) MIT Press. Cambridge, 2000.
"Face Analysis for the Synthesis of Photo-Realistic Talking Heads", by Graf, H. P., Cosatto, E., and Essat, T., Proc. Fourth IEEE Int. Conf. Automatic Face and Gesture Recognition, Grenoble, France, IEEE Computer Society, Los Alamitos, 2000, pp. 189-194.
"Inter-transcriber reliability and ToBl Prosodic Labeling" by Syrdal, A. K., and McGory, J., ICSLP 2000, Beijing, China; vol. 3, pp. 235-238.
"Photo-Realistic Talking Heads for Image Samples", by Cosatto, E., and Graf, H. P., IEEE Trans. Multimedia, pp. 152-163, Sep. 2000.
"Soft Machine: A Personable Interface", by Lewis, J., Purcell, P., Architecture Machine Group, Massachusetts Institute of Technology, Cambridge, MA 02139.
"Spoken Language Processing", by Huang X., Acero, A., Hon, H., Prentice Hall, 2001, pp. 739-791.
"The Timing of Shifts in Head Postures During Conversation", by Hadar, U., Steiner, T. J., Grant, E. C., Rose, F. C., Human Movement Science, 3, pp. 237-245, 1984.
"The ToBl Annotation Conventions", Beckman, M., Herschberg, J., http://www.ling.ohio-state.edu/phonetics/ToBI/ToBl6.html.
"ToBl: A Strandard for Labeling English Prosody", by Silverman, K., Beckman, M., Pitrelli, J., Ostemdorf, M., Wrightman, C., Price, P., Pierrehumbert, J., Herschberg, J., Int. Conf. on Spoken Language Processing, 1992, Banff, Canada, pp. 867-870.
"Visual Prosody: Facial Movements Accompanying Speech" by Graf, Hans Peter, Cosatto, Eric, Strom, Volker and Huang, Fu Jie, AT & T Labs Research, 200 Laurel Ave., South, Middletown, NJ 07748.
Ball et al., in "Embodied Conversational Agents," MIT press 2000, pp. 189-219.*
Cappella et al., "Rules for Responsive Robots: Using Human Interactions to Build Virtual Interactions," in "Stabiltiy and Change in Relationships" editors Vangelisti et al., Cambridge University Press, 2001.*
Cassell et al., "Embodiment in conversational interfaces: Rea," Chi 99, May 1999, pp. 520-527.*
Hayes-Roth et al., "Desigining for Diversity: Multi-Cultural Characters for a Multi-Cultural World," Proceedings of Imagine, Feb. 2002.*
Kakumanu et al., "Speech Driven Facial Animation," Proceedings of the 2001 Workshop on Preceptive user Interfaces, Nov. 2001, Orlando Florida.*
Lavagetto, F. et al., "Converting Speech into Lip Movements: A Multimdia Telephone for Hard of Hearing People," IEEE Transactions on Rehabilitation Engineering, vol. 3, No. 1, Mar. 1995.*
Marquis, et al., "Emotionally responsive poker playing agents," in Notes for Twelfth National Conference on Artificial Intelligence, AAAI-94 Workshop on Artificial Intelligence, Artificial Life, and Entertainment, AAAI 1994.*

Cited By (86)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7143044B2 (en)*2000-12-292006-11-28International Business Machines CorporationTranslator for infants and toddlers
US20020084902A1 (en)*2000-12-292002-07-04Zadrozny Wlodek W.Translator for infants and toddlers
US7711560B2 (en)*2003-02-192010-05-04Panasonic CorporationSpeech recognition device and speech recognition method
US20050256712A1 (en)*2003-02-192005-11-17Maki YamadaSpeech recognition device and speech recognition method
US7308407B2 (en)*2003-03-032007-12-11International Business Machines CorporationMethod and system for generating natural sounding concatenative synthetic speech
US20040176957A1 (en)*2003-03-032004-09-09International Business Machines CorporationMethod and system for generating natural sounding concatenative synthetic speech
US20040230410A1 (en)*2003-05-132004-11-18Harless William G.Method and system for simulated interactive conversation
US7797146B2 (en)*2003-05-132010-09-14Interactive Drama, Inc.Method and system for simulated interactive conversation
US20130060566A1 (en)*2003-07-032013-03-07Kazumi AoyamaSpeech communication system and method, and robot apparatus
US20050043956A1 (en)*2003-07-032005-02-24Sony CorporationSpeech communiction system and method, and robot apparatus
US8538750B2 (en)*2003-07-032013-09-17Sony CorporationSpeech communication system and method, and robot apparatus
US8321221B2 (en)*2003-07-032012-11-27Sony CorporationSpeech communication system and method, and robot apparatus
US20120232891A1 (en)*2003-07-032012-09-13Sony CorporationSpeech communication system and method, and robot apparatus
US8209179B2 (en)*2003-07-032012-06-26Sony CorporationSpeech communication system and method, and robot apparatus
US20050069852A1 (en)*2003-09-252005-03-31International Business Machines CorporationTranslating emotion to braille, emoticons and other special symbols
US7607097B2 (en)*2003-09-252009-10-20International Business Machines CorporationTranslating emotion to braille, emoticons and other special symbols
US20050131744A1 (en)*2003-12-102005-06-16International Business Machines CorporationApparatus, system and method of automatically identifying participants at a videoconference who exhibit a particular expression
US20050131697A1 (en)*2003-12-102005-06-16International Business Machines CorporationSpeech improving apparatus, system and method
US8214214B2 (en)*2004-12-032012-07-03Phoenix Solutions, Inc.Emotion detection device and method for use in distributed systems
US20100036660A1 (en)*2004-12-032010-02-11Phoenix Solutions, Inc.Emotion Detection Device and Method for Use in Distributed Systems
US20060217979A1 (en)*2005-03-222006-09-28Microsoft CorporationNLP tool to dynamically create movies/animated scenes
US7512537B2 (en)*2005-03-222009-03-31Microsoft CorporationNLP tool to dynamically create movies/animated scenes
US20070036334A1 (en)*2005-04-222007-02-15Culbertson Robert FSystem and method for intelligent service agent using VoIP
US7711103B2 (en)*2005-04-222010-05-04Culbertson Robert FSystem and method for intelligent service agent using VoIP
US20090234639A1 (en)*2006-02-012009-09-17Hr3D Pty LtdHuman-Like Response Emulator
US9355092B2 (en)*2006-02-012016-05-31i-COMMAND LTDHuman-like response emulator
US7890334B2 (en)*2006-02-142011-02-15Samsung Electronics Co., Ltd.System and method for controlling voice detection of network terminal
US20070201639A1 (en)*2006-02-142007-08-30Samsung Electronics Co., Ltd.System and method for controlling voice detection of network terminal
US20140127662A1 (en)*2006-07-122014-05-08Frederick W. KronComputerized medical training system
US20080120548A1 (en)*2006-11-222008-05-22Mark MoritaSystem And Method For Processing User Interaction Information From Multiple Media Sources
US8793132B2 (en)*2006-12-262014-07-29Nuance Communications, Inc.Method for segmenting utterances by using partner's response
US20080154594A1 (en)*2006-12-262008-06-26Nobuyasu ItohMethod for segmenting utterances by using partner's response
US20080215325A1 (en)*2006-12-272008-09-04Hiroshi HoriiTechnique for accurately detecting system failure
US9368102B2 (en)*2007-03-202016-06-14Nuance Communications, Inc.Method and system for text-to-speech synthesis with personalized voice
US20150025891A1 (en)*2007-03-202015-01-22Nuance Communications, Inc.Method and system for text-to-speech synthesis with personalized voice
US20080313130A1 (en)*2007-06-142008-12-18Northwestern UniversityMethod and System for Retrieving, Selecting, and Presenting Compelling Stories form Online Sources
US10176827B2 (en)2008-01-152019-01-08Verint Americas Inc.Active lab
US20140365223A1 (en)*2008-01-152014-12-11Next It CorporationVirtual Assistant Conversations
US10109297B2 (en)2008-01-152018-10-23Verint Americas Inc.Context-based virtual assistant conversations
US10438610B2 (en)*2008-01-152019-10-08Verint Americas Inc.Virtual assistant conversations
US9589579B2 (en)2008-01-152017-03-07Next It CorporationRegression testing
US20090182702A1 (en)*2008-01-152009-07-16Miller Tanya MActive Lab
US9020816B2 (en)2008-08-142015-04-2821Ct, Inc.Hidden markov model for speech processing with training method
US8224652B2 (en)*2008-09-262012-07-17Microsoft CorporationSpeech and text driven HMM-based body animation synthesis
US20100082345A1 (en)*2008-09-262010-04-01Microsoft CorporationSpeech and text driven hmm-based body animation synthesis
US11663253B2 (en)2008-12-122023-05-30Verint Americas Inc.Leveraging concepts with information retrieval techniques and knowledge bases
US10489434B2 (en)2008-12-122019-11-26Verint Americas Inc.Leveraging concepts with information retrieval techniques and knowledge bases
US11250072B2 (en)2009-09-222022-02-15Verint Americas Inc.Apparatus, system, and method for natural language processing
US9552350B2 (en)2009-09-222017-01-24Next It CorporationVirtual assistant conversations for ambiguous user input and goals
US9563618B2 (en)2009-09-222017-02-07Next It CorporationWearable-based virtual agents
US10795944B2 (en)2009-09-222020-10-06Verint Americas Inc.Deriving user intent from a prior communication
US11727066B2 (en)2009-09-222023-08-15Verint Americas Inc.Apparatus, system, and method for natural language processing
US11367435B2 (en)2010-05-132022-06-21Poltorak Technologies LlcElectronic personal interactive device
US11341962B2 (en)2010-05-132022-05-24Poltorak Technologies LlcElectronic personal interactive device
US11403533B2 (en)2010-10-112022-08-02Verint Americas Inc.System and method for providing distributed intelligent assistance
US10210454B2 (en)2010-10-112019-02-19Verint Americas Inc.System and method for providing distributed intelligent assistance
US11960694B2 (en)2011-12-302024-04-16Verint Americas Inc.Method of using a virtual assistant
US9836177B2 (en)2011-12-302017-12-05Next IT Innovation Labs, LLCProviding variable responses in a virtual-assistant environment
US10983654B2 (en)2011-12-302021-04-20Verint Americas Inc.Providing variable responses in a virtual-assistant environment
US10379712B2 (en)2012-04-182019-08-13Verint Americas Inc.Conversation user interface
US9824188B2 (en)2012-09-072017-11-21Next It CorporationConversational virtual healthcare assistant
US9536049B2 (en)2012-09-072017-01-03Next It CorporationConversational virtual healthcare assistant
US11829684B2 (en)2012-09-072023-11-28Verint Americas Inc.Conversational virtual healthcare assistant
US11029918B2 (en)2012-09-072021-06-08Verint Americas Inc.Conversational virtual healthcare assistant
US11099867B2 (en)2013-04-182021-08-24Verint Americas Inc.Virtual assistant focused user interfaces
US12182595B2 (en)2013-04-182024-12-31Verint Americas Inc.Virtual assistant focused user interfaces
US10445115B2 (en)2013-04-182019-10-15Verint Americas Inc.Virtual assistant focused user interfaces
US10928976B2 (en)2013-12-312021-02-23Verint Americas Inc.Virtual assistant acquisitions and training
US10088972B2 (en)2013-12-312018-10-02Verint Americas Inc.Virtual assistant conversations
US9823811B2 (en)2013-12-312017-11-21Next It CorporationVirtual assistant team identification
US9830044B2 (en)2013-12-312017-11-28Next It CorporationVirtual assistant team customization
US9301722B1 (en)*2014-02-032016-04-05Toyota Jidosha Kabushiki KaishaGuiding computational perception through a shared auditory space
US10545648B2 (en)2014-09-092020-01-28Verint Americas Inc.Evaluating conversation data based on risk factors
US9659564B2 (en)*2014-10-242017-05-23Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayi Ticaret Anonim SirketiSpeaker verification based on acoustic behavioral characteristics of the speaker
US20160118050A1 (en)*2014-10-242016-04-28Sestek Ses Ve Iletisim Bilgisayar Teknolojileri Sanayi Ticaret Anonim SirketiNon-standard speech detection system and method
US11663409B2 (en)2015-01-232023-05-30Conversica, Inc.Systems and methods for training machine learning models using active learning
US11551188B2 (en)2015-01-232023-01-10Conversica, Inc.Systems and methods for improved automated conversations with attendant actions
US20160217500A1 (en)*2015-01-232016-07-28Conversica, LlcSystems and methods for management of automated dynamic messaging
US11301632B2 (en)2015-01-232022-04-12Conversica, Inc.Systems and methods for natural language processing and classification
US10803479B2 (en)*2015-01-232020-10-13Conversica, Inc.Systems and methods for management of automated dynamic messaging
US10586369B1 (en)2018-01-312020-03-10Amazon Technologies, Inc.Using dialog and contextual data of a virtual reality environment to create metadata to drive avatar animation
US11568175B2 (en)2018-09-072023-01-31Verint Americas Inc.Dynamic intent classification based on environment variables
US11847423B2 (en)2018-09-072023-12-19Verint Americas Inc.Dynamic intent classification based on environment variables
US11989521B2 (en)2018-10-192024-05-21Verint Americas Inc.Natural language processing with non-ontological hierarchy models
US11825023B2 (en)2018-10-242023-11-21Verint Americas Inc.Method and system for virtual assistant conversations
US11196863B2 (en)2018-10-242021-12-07Verint Americas Inc.Method and system for virtual assistant conversations

Also Published As

Publication numberPublication date
US7844467B1 (en)2010-11-30
US7353177B2 (en)2008-04-01
US7349852B2 (en)2008-03-25
US20060074689A1 (en)2006-04-06
US8200493B1 (en)2012-06-12
US20060074688A1 (en)2006-04-06

Similar Documents

PublicationPublication DateTitle
US7076430B1 (en)System and method of providing conversational visual prosody for talking heads
US7136818B1 (en)System and method of providing conversational visual prosody for talking heads
Graf et al.Visual prosody: Facial movements accompanying speech
CN110688911B (en)Video processing method, device, system, terminal equipment and storage medium
CN106653052B (en)Virtual human face animation generation method and device
Marsella et al.Virtual character performance from speech
US5884267A (en)Automated speech alignment for image synthesis
US20120130717A1 (en)Real-time Animation for an Expressive Avatar
Albrecht et al.Automatic generation of non-verbal facial expressions from speech
Benoit et al.Audio-visual and multimodal speech systems
Lundeberg et al.Developing a 3D-agent for the august dialogue system.
Gibbon et al.Audio-visual and multimodal speech-based systems
Schröder et al.Towards responsive sensitive artificial listeners
Albrecht et al." May I talk to you?:-)"-facial animation from text
Nordstrand et al.Measurements of articulatory variation in expressive speech for a set of Swedish vowels
Kolivand et al.Realistic lip syncing for virtual character using common viseme set
Verma et al.Animating expressive faces across languages
Granström et al.Modelling and evaluating verbal and non-verbal communication in talking animated interface agents
Zoric et al.Towards facial gestures generation by speech signal analysis using huge architecture
GranströmMulti-modal speech synthesis with applications
Beskow et al.Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents
Granström et al.Speech and gestures for talking faces in conversational dialogue systems
Fanelli et al.Acquisition of a 3d audio-visual corpus of affective speech
Wang et al.A real-time text to audio-visual speech synthesis system.
PuebloVideorealistic facial animation for speech-based interfaces

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:AT&T CORP., NEW YORK

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:COSATTO, ERIC;GRAF, HANS PETER;ISAACSON, THOMAS M.;AND OTHERS;REEL/FRAME:013038/0483;SIGNING DATES FROM 20020610 TO 20020614

STCFInformation on status: patent grant

Free format text:PATENTED CASE

FEPPFee payment procedure

Free format text:PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPPFee payment procedure

Free format text:PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAYFee payment

Year of fee payment:4

FPAYFee payment

Year of fee payment:8

ASAssignment

Owner name:AT&T INTELLECTUAL PROPERTY II, L.P., GEORGIA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AT&T PROPERTIES, LLC;REEL/FRAME:038275/0130

Effective date:20160204

Owner name:AT&T PROPERTIES, LLC, NEVADA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AT&T CORP.;REEL/FRAME:038275/0041

Effective date:20160204

ASAssignment

Owner name:NUANCE COMMUNICATIONS, INC., MASSACHUSETTS

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AT&T INTELLECTUAL PROPERTY II, L.P.;REEL/FRAME:041512/0608

Effective date:20161214

MAFPMaintenance fee payment

Free format text:PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553)

Year of fee payment:12

ASAssignment

Owner name:MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:065530/0871

Effective date:20230920


[8]ページ先頭

©2009-2025 Movatter.jp