ACCV Worshop: Multi-view lip-reading/audio-visual challenges

November 20, 2016 -- taipei, taiwan

Workshop Program

13:30 Opening presentation

13:45 Invited talk "Audio and Visual Modality Combination in Speech Processing Applications" by Prof. Gerasimos Potamianos

14:30 Out of time: automated lip sync in the wild, Joon Son Chung, Andrew Zisserman

14:50 Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System, Marina Zimmermann, Mostafa Mehdipour Ghazi, Hazım Kemal Ekenel,Jean-Philippe Thiran

15:10 Break

15:30 Concatenated Frame Image based CNN for Visual Speech Recognition, Takeshi Saitoh, Ziheng Zhou, Guoying Zhao and Matti Pietikainen

15:50 Multi-View Automatic Lip-Reading using Neural Network, Daehyun Lee, Jongmin Lee and Kee-Eung Kim

16:10 Lip Reading from Multi View Facial Images Using 3D-AAM, Takuya Watanabe, Kouichi Katsurada and Yasushi Kanazawa

16:30 Closing

Submission Information

Submission page: https://easychair.org/conferences/?conf=mlac2016

Submission deadline: September 10, 2016

Paper format: The workshop paper format should follow the guideline for paper submission in ACCV2016 (see http://www.accv2016.org/paper-submission/), where the page limit is 14 (excluding references) in the LNCS format.

Notification of acceptance: September 17, 2016

Camera ready: September 25, 2016

Introduction

It is known that human speech perception is a bimodal process that makes use of both acoustic and visual information. There is clear evidence that visual cues play an important role in automatic speech recognition either when audio is seriously corrupted by noise, through audiovisual speech recognition (AVSR) or even when it is inaccessible, through automatic lip-reading (ALR).

This workshop is aimed to challenge researchers to deal with the large variations of the speakers' appearances caused by camera-view changes in the context of ALR/AVSR. To this end, we have collected a multi-view audiovisual database, named 'OuluVS2' [1], which includes 52 speakers uttering both discrete and continuous utterances, simultaneously recorded from 5 different camera views. To facilitate participants, we have preprocessed most of the data to extract the regions of interest, that is, a rectangular area including the talking mouth. The cropped mouth videos are available to researchers together with the original ones.

Please visit Home for instructions on how to download the database.

Researchers are invited to work on either type of data and tackle (but not limited to) the following problems:

  • Single-view ALR/AVSR - to train and test on data recorded from a single camera view.
  • Multiple-view ALR/AVSR – to train and test on synchronized data recorded from multiple camera views.
  • Cross-view ALR/AVSR – to learn and transfer knowledge from a source view (e.g., the frontal view) to enhance learning for a target view (e.g., the profile view).

Test Protocols

We encourage participants to use the OuluVS2 database due to its suitability, readiness and richness. To make results comparable, we randomly divide all the subjects into two groups, one (including 40 subjects) for training and validation and one (including 12 subjects) for testing.

Click here to download the txt file including subject ids for training and validation.

Click here to download the txt file including subject ids for testing.

We expect all the participants who use the OuluVS2 database to follow the same protocol so that we can summarize and present their results at the workshop.

Additional Information

Venue: the workshops will be held at Taipei International Convention Center (TICC), the same as the main-conference venue.

Publication: the workshop papers will be published by Springer in the Lecture Notes in Computer Science (LNCS) series.

Registration: workshop registration will be handled as part of the main conference registration. Details will be announced later.

People

Organizers

Program Committee

Reference

[1] I. Anina, Z. Zhou, G. Zhao and M. Pietikainen (2015) OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 1-5, Ljubljana, Slovenia.