Datebase Details

Recording Setting

During recording, there were six cameras placed around a subject as illustrated by the figure on the right. Utterances were shown on a computer monitor located slightly to the left behind the frontal cameras. Subjects were asked to keep the head still and facial expression neutral. The six cameras included five GoPro Hero3 Black Edition cameras (resolution 1920x1080, 30 fps and audio bit rate 128 kbps) and a PuxeLink PL-B774U camera (resolution 640x480, 100 fps and no audio recording).

The recording was made in an ordinary office environment with three extra lights placed behind the camera to illuminate the subject’s face. There was no control of the background noise. Therefore, the recorded audio could include, for instance, conversations, chair moving sound or even baby crying.


Phase 1: digit sequence
"1 7 3 5 1 6 2 6 6 7"
"4 0 2 9 1 8 5 9 0 4"
Phase 2: phrases
"Thank you"
"Have a good time"
Phase 3: TIMIT sentences
"Chocolate and roses never fail as a romatic gift"

There are three phrases in each data collection session. In phase 1, a subject was asked to utter continuously ten fixed digit sequences. Each sequence consisted of ten randomly generated digits and was repeated three times during recording.

In phase 2 the subject was asked to speak ten daily-use short English phrases. The same set of phrases was used in our previously collected OuluVS database. Every phrase was uttered three times.

In phase 3 the subject was asked to read five randomly selected TIMIT sentences. Every sentence was read only once. A separate set of sentences was generated for every subject. The table on the left shows examples of the utterances used in our data collection.

Sample Videos