Lip synchronization is the process of generating natural lip movements from a speech signal. In this work we address the lip-sync problem using an automatic phone recognizer that generates a phone lattice carrying posterior probabilities. The acoustic feature vector contains the posterior probabilities of all the phones over a time window centered at the current time point. Hence this representation characterizes the phone recognition output including the confusion patterns caused by its limited accuracy. A 3D face model with varying texture is computed by analyzing a video recording of the speaker using a 3D morphable model. Training a neural network using 30 000 data vectors from an audiovisual recording in Dutch resulted in a very good simulation of the face on independent data sets of the same or of a different speaker.
@inproceedings{moubayed08_interspeech, title = {Lip synchronization: from phone lattice to PCA eigen-projections using neural networks}, author = {Samer Al Moubayed and Michael De Smet and Hugo {Van hamme}}, year = {2008}, booktitle = {Interspeech 2008}, pages = {2016--2019}, doi = {10.21437/Interspeech.2008-524}, issn = {2958-1796},}
Cite as:Moubayed, S.A., Smet, M.D., Van hamme, H. (2008) Lip synchronization: from phone lattice to PCA eigen-projections using neural networks. Proc. Interspeech 2008, 2016-2019, doi: 10.21437/Interspeech.2008-524