Audio visual speech recognition (AVSR) is a technique that usesimage processing capabilities inlip reading to aidspeech recognition systems in recognizing indeterministicphones or giving preponderance among near probability decisions.
Each system oflip reading andspeech recognition works separately, then their results are mixed at the stage offeature fusion. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.
Thisspeech recognition-related article is astub. You can help Wikipedia byadding missing information. |