Abstract
This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted sum of three items: 1) logarithm of concatenation smoothness of the synthesized mouth trajectory; 2) logarithm of the pronunciation distance; 3) logarithm of the audio intensity distance between the candidate diviseme instance and the target diviseme segment in the incoming speech. The selected diviseme instances are time warped and blended to construct the mouth animation. Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite. Clear, accurate, smooth mouth animations can be obtained matching well with the pronunciation and intensity changes in the incoming speech. Moreover, with the logarithm function in the accumulative cost, it is easy to set the weights to obtain optimal mouth animations.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.
Similar content being viewed by others
References
McGurk H, MacDonaldd J (1976) Hearing lips and seeing voices. Nature 264:746–748
Massaro D (1998) Perceiving talking faces. MIT Press, Cambridge
Theobald BJ, Bangham JA, Matthews IA, Cawley GC (2004) Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Commun. 44:127–140
Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar. In: Proc of the international conference on spoken language processing (ICSLP), Pittsburg, USA, Sep 17–21
Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Computer graphics annual conference series (SIGGRAPH), pp 353–360, Los Angeles, California
Cosatto E, Graf H (1998) Sample-based synthesis of photorealistic talking heads. In: Proc. of computer animation, pp 103–110, Philadelphia, Pennsylvania
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38:45–57
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Proc of the international conference on computer graphics and interactive techniques (SIGGRAPH), pp 388–398, San Antonio, Texas
Huang F, Cosatto E, Graf H (2002) Triphone based unit selection for concatenative visual speech synthesis. In: Proc of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol II, pp 2037–2040, Orlando, Florida, USA
Fagel S (2004) Video-realistic synthetic speech with a parametric visual speech synthesizer. In: Proc of the 8th international conference on spoken language processing (INTERSPEECH), pp 2033–2036
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden Markov models. Speech Commun 26(1):105–115
Nakamura S, Yamamoto E, Shikano K (1998) Speech-to-lip movement synthesis by maximizing audio-visual joint probability based on the EM algorithm. In: Proc of the IEEE second workshop on multimedia signal processing (MMSP), pp 53–58
Choi K, Luo Y, Hwang J (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J VLSI Signal Process 29:51–61
Aleksic PS, Katsaggelos AK (2003) Speech-to-video synthesis using facial animation parameters. In: Proc of the 2003 international conference on image processing (ICIP03), vol 2, issue III, pp 1–4
Cosker D, Marshall D, Rosin P, Hicks Y (2004) Speech driven facial animation using a hidden Markov coarticulation model. In: Proc of the 17th international conference on pattern recognition 2004 (ICPR2004), vol 1, pp 128–131
Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510
Jiang D, Xie L, Ravyse I, Zhao R, Sahli H, Cornelis J (2002) Triseme decision trees in the continuous speech recognition system for a talking head. In: Proc of the 1st IEEE international conference on machine learning and cybernetics, pp 2097–2100
Verma A, Rajput N, Subramaniam L (2003) Using viseme based acoustic models for speech driven lip synthesis. In: Proc of the IEEE international conference on acoustic speech and signal processing (ICASSP), pp 720–723
Deng Z, Neumann U, Lewiss JP et al. (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans Vis Comput Graph 12(6):1–12
Cao Y, Faloutsos P, Kohler E, Pighin F (2004) Real-time speech motion synthesis from recorded motions. In: Proc of the ACM SIGGRAPH/Eurographics symposium on computer animation, pp 347–355
Ma J, Cole R, Pellom B, Ward W, Wise B (2006) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 12:1–11
Ravyse I, Enescu V, Sahli H (2005) Kernel-based head tracker for videophony. In: Proc of the IEEE international conference on image processing 2005 (ICIP2005), Genoa, Italy, vol 3, pp 1068–1071
Hou Y, Sahli H, Ravyse I, Zhang Y, Zhao R (2007) Robust shape-based head tracking. In: Proc of the advanced concepts for intelligent vision systems. LNCS, vol 4678, pp 340–351
Ma J, Cole R, Pellom B, Ward W, Wise B (2004) Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Comput Animat Virtual Worlds 15:485–500
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1. Accessed 5 December 2008
Young SJ (1993) The HTK hidden Markov model toolkit: design and philosophy. Technical Report, University of Cambridge, Department of Engineering, Cambridge, UK
Jiang D, Ravyse I, Sahli H, Zhang Y (2008) Accurate visual speech synthesis based on diviseme unit selection and concatenation. In: Proc of the IEEE 10th workshop on multimedia signal processing (MMSP2008), pp 906–909
Schaback R (1995) Computer aided geometric design III. Vanderbilt University Press, Nashville, pp 477–496
Ravyse I (2006) Facial analysis and synthesis. Ph.D. Thesis, Dept. Electronics and Informatics, Vrije Universiteit Brussel, Belgium
Cosatto E, Potamianos G, Graf HP (2000) Audio visual unit selection for the synthesis of photo-realistic talking heads. In: Proc of the IEEE international conference on multimedia and expo (ICME), vol 2, pp 619–622
Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio-visual speech recognition. In: Proc of the IEEE international symposium on multimedia (ISM), pp 37–44
Author information
Authors and Affiliations
School of Computer Science, Northwestern Polytechnical University, 127 Youyi Xilu, Xi’an, 710072, P.R. China
Dongmei Jiang
Department ETRO, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
Ilse Ravyse, Hichem Sahli & Werner Verhelst
- Dongmei Jiang
You can also search for this author inPubMed Google Scholar
- Ilse Ravyse
You can also search for this author inPubMed Google Scholar
- Hichem Sahli
You can also search for this author inPubMed Google Scholar
- Werner Verhelst
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toDongmei Jiang.
Rights and permissions
About this article
Cite this article
Jiang, D., Ravyse, I., Sahli, H.et al. Speech driven realistic mouth animation based on multi-modal unit selection.J Multimodal User Interfaces2, 157 (2008). https://doi.org/10.1007/s12193-009-0015-7
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative