We conducted some preliminary experiments on the pre-processed short-phrase data. The figure on the right shows the recognition results. There were three different methods implemented for comparison:

  • DCT+PCA+HMM - 2D-DCT features compressed by PCA (50d) and classfied by HMMs
  • DCT+HiLDA+HMM - 2D-DCT features compressed by HiLDA [1] and classfied by HMMs
  • RAW+PLVM - raw pixel values classfied by latent variable models [2]

In the experiments, images from the frontal, 30 degree, 45 degree, 60 degree and profile views were resized into dimensions of 20x25, 20x25, 25x25, 25x25 and 25x20, respectively. Leave-one-speaker-out cross validation was used to generate the recognition results.


[1] G. Potamianos, C. Neti, and G. Gravier, “Recent Advances in the Automatic Recognition of Audio-Visual Speech,” Proc. IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003.

[2] Z. Zhou, X. Hong, G. Zhao, and M. Pietikainen. A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell., 36(1):181–187, 2014.