Logo of the AI group

The Democracy Now! data set

The data

Below is the complete dataset that was used for the research described in our NIPS 2007 paper. Since the files are rather large, please be considerate and save bandwidth by downloading only the parts that are really of interest to you.

This dataset is provided free of charge and with no warranty, neither expressed nor implied, and subject to the following conditions:

  • If redistributed, the files should must come with this notice, which must not be modified in any way.
  • Any publications using this dataset should cite this paper:
    G. Englebienne, T. F. Cootes and M. Rattray. A probabilistic model for generating realistic lip movements from speech. In Advances in Neural Information Processing Systems 21, 2008
Sequence num. Sequences Original video Lossless audio
1 - 37 demnow-2006-03-20.tbz2 (150M) demnow-2006-03-20.mp4 (423M) demnow-2006-03-20-1.flac (150M)
46 - 62 demnow-2006-03-27.tbz2 (43M) demnow-2006-03-27.mp4 (419M) demnow-2006-03-27-1.flac (135M)
101 - 142 demnow-2006-03-24.tbz2 (97M) demnow-2006-03-24.mp4 (417M) demnow-2006-03-24-1.flac (132M))
150 - 188 demnow-2006-03-29.tbz2 (90M) demnow-2006-03-29.mp4 (414M) demnow-2006-03-29-1.flac (149M)
200 - 236 demnow-2006-03-30.tbz2 (68M) demnow-2006-03-30.mp4 (416M) demnow-2006-03-30-1.flac (137M)
300 - 365 demnow-2006-04-03.tbz2 (132M) demnow-2006-04-03.mp4 (419M) demnow-2006-04-03-1.flac (131M)
400 - 464 demnow-2006-04-04.tbz2 (125M) demnow-2006-04-04.mp4 (416M) demnow-2006-04-04-1.flac (121M)
500 - 590 demnow-2006-05-01.tbz2 (192M) demnow-2006-05-01.mp4 (418M) demnow-2006-05-01-1.flac (131M)
600 - 643 demnow-2006-03-22.tbz2 (106M) demnow-2006-03-22.mp4 (417M) demnow-2006-03-22-1.flac (148M)
650 - 689 demnow-2006-05-02.tbz2 (100M) demnow-2006-05-02.mp4 (416M) demnow-2006-05-02-1.flac (135M)
700 - 756 demnow-2006-03-17.tbz2 (110M) demnow-2006-03-17.mp4 (416M) demnow-2006-03-17-1.flac (140M)
800 - 863 demnow-2006-03-21.tbz2 (140M) demnow-2006-03-21.mp4 (419M) demnow-2006-03-21-1.flac (146M)
900 - 965 demnow-2006-03-23.tbz2 (120M) demnow-2006-03-23.mp4 (419M) demnow-2006-03-23-1.flac (135M)
1000 - 1069 demnow-2006-03-28.tbz2 (136M) demnow-2006-03-28.mp4 (415M) demnow-2006-03-28-1.flac (136M)
1100 - 1145 demnow-2006-03-31.tbz2 (103M) demnow-2006-03-31.mp4 (417M) demnow-2006-03-31-1.flac (132M)
1200 - 1252 demnow-2006-05-03.tbz2 (123M) demnow-2006-05-03.mp4 (417M) demnow-2006-05-03-1.flac (136M)

About the data

Organisation

This contains a data set of 803 video sequences of a talking head. The sequences were manually cut out of freely available MP4-encoded broadcasts of Democracy Now!, an American news show. The sequences are grouped into directories according to the particular instance of the show they were cut out of. Each directory was compressed separately in tar.bz2 format. Each individual sequence is stored in its own subdirectory, which contains the following files:

project.js
This is a ECMAscript file generated by avidemux2, the program used to cut out the sequence. It contains information about the date of the show, and the index of the start and end frame of the sequence within that show and could be used to modify the sequence based on the original data.
audio.wav
The sound of the sequence, decompressed and extracted from the video file.
hifi.wav
The same sound as audio.wav, but extracted from the corresponding radio show. Since the radio show is available in uncompressed CD quality, this data is of higher quality than audio.wav, even though the sample rate is lower.
hifi.mfc
The MFCC coefficients of hifi.wav, computed at 100Hz by the HTK toolkit and stored in HTK's .mfc file format.
transcript.txt
The textual transcription of the sequence.
align.lab
Phonetic equivalent of transcript.txt according to CMUDict v0.6, aligned to the sound MFCC samples by computation of the Viterbi path through an unrolled HMM. When multiple pronunciations for a single word are listed in the dictionary, this results in choosing the most likely pronunciation.
align@29-voting.lab
Same as align.lab, but downsampled to match the samples in mouth.mfc
mouth.mfc
The parameters of the AAM model as fitted to the video frames. The AAM model takes 32 parameters, and there is one sample per video frame. Again, the results were stored in HTK .mfc file format.
mouth-delta.mfc
Same as mouth.mfc, but augmented with delta features.
img
Contains one JPEG-compressed image of each frame of the sequence. Each frame was greyscaled and cropped to the face of the person talking.
The sequences are numbered in the order in which they were extracted from the original MP4 (which does not correspond to the order in which the MP4 were downloaded). Subsequent sequence numbers from the same show are generally chronological, but there may be exceptions (this can be checked by looking at the project.js file.

Preprocessing

Extraction of hifi.wav

The audio in the MP4 stream is AAC compressed. This gives audio which sounds quite good to human ears, however early tests with this data (but were done on RealMedia rather than MP4-encoded video) showed that HMMs used for speech recognition performed markedly worse on the compressed sound than on CD-quality sound. Fortunately, the Democracy Now! show exists both as a television show and as a radio show, and the radio show is available online in CD quality. Both shows contain the same audio track of the presenter talking but seem to have a slightly different organisation. We therefore took the audio sequence extracted from the video and tried to find the same sequence in the radio show.

Finding the corresponding sequence was done by minimisation of the sum-squared error of the envelope of both soundwaves. The resulting sequences were manually checked. There may be a misalignment of at most 5ms between both sequences (due to the resolution at which the envelope was computed), but then again, the alignment of sound and video within the MP4 is not exact either.

Processing of the frames

Each individual frame was stored as a separate JPEG file, after it was cropped to the region of interest --- the face of the person talking --- and grey scaled. An Active Appearance Model was then fitted to each individual frame, and the parameters to that model were stored in a separate file, mouth.mfc.


Gwenn Englebienne, August 2007.