Back Home

An End-to-End Baseline for Video Captioning

Generating textual descriptions based on video content is one of recent challenges that has received increasing attention for both computer vision and natural language processing communities. Some of its exciting applications include human-robot interaction, automated video content description and assisting the visually impaired by describing the content of movies to them.
In collaboration with the Visual Artificial Intelligence Laboratory (Oxford Brookes University) we are glad to present our first research project that wants to handle the problem of end-to-end training applied for the first time to a video captioning model. It is actually on going with the aims to release the final version (including code) until the end of 2019.


The dominant approach in video captioning is currently based on sequence learning using an encoder-decoder framework. In this models, the encoder represents an input video sequence as a fixed dimension feature vector, which is then fed to the decoder to generate the output sentence one word at a time. One of the most severe drawbacks of such models, however, is that the underlying feature space for the video content is static and does not change during the training process. More specifically, an encoder (typically, a Convolutional Neural Network) is pre-trained on other datasets built for different tasks, to be then used as feature extractor on the video-captioning datasets. The resulting disjoint training process, in which the decoder is trained on the captioning task with static features as input, is inherently suboptimal. We propose to address this problem by bringing forward  the end-to-end training of both encoder and decoder as a whole. Our philosophy is inspired by the success of end-to-end trainable network architectures in image recognition, image segmentation, object detection and image captioning but was never before adopted in a video captioning setting.

In this work we decided to use Inception-ResNet-v2 as the encoder and a variant of Soft-Attention with LSTM as decoder. We decided to create our version of the decoder written using PyTorch. Our SA-LSTM framework contemplates a number of variants to the original formulation that improve significantly the performance. Unfortunately, Inception-ResNet-v2 is very expensive in term of memory requirements, hence large batch sizes are difficult to implement. A single batch, for instance would use 5 GigaByte of GPU memory. To overcome this problem, our training strategy is centred on accumulating gradients until the neural network has processed 512 examples. After that, the accumulated gradients are used to update the parameters of both encoder and decoder.

Hence, our gradient accumulation strategy would be quite slow, as opposed to disjoint training in which GPU memory requirements are much lower, if naively implemented. To strike a balance between a closer to optimal but slower end-to-end training setup and a faster but less optimal disjoint training framework we adopt a two-stage training process.
In the first stage, we freeze the weights of the pre-trained encoder to start training the decoder. As the encoder’s weights are kept constant, this is equivalent to train a decoder on pre-computed features of the encoder. As a result, memory requirements are low, and the process is fast. Once the decoder reaches a reasonable performance on the vali- dation set, the second stage of the training process starts. In the second stage, the whole network is trained end-to-end while freezing the batch normalisation layer of Inception-ResNet-v2. In both phases, SA-LSTM uses the real target outputs (i.e., the target words) as next input, rather than its own previous prediction.


We evaluate our model and compare it with our competitors on two standard video captioning benchmarks: MSVD and MSR-VTT. The former is one of the first video captioning datasets to include multicategory videos. The latter, on the other hand, is based on 20 categories (e.g., music, people, gaming, sports, TV shows) and is of much larger scale than MSVD. To guarantee a fair quantitative comparison with the state-of-the-art we used the most common and well known metrics: BLEU (4-gram version), METEOR, ROUGE-L and CIDEr.
Quantitatively, our approach matches or outperforms all the work done previously, except when measured using the BLEU metric. In fact, as explained, BLEU is a metric that has many weaknesses, e.g., the lack of explicit word-matching between translation and reference. In opposition CIDEr was specifically designed to evaluate automatic caption generation from visual sources. Hence, it makes more sense to stress the results under the CIDEr.
From a qualitative point of view, we reports both some positive and some negative examples. Generally, we can notice that the increase in accuracy achieved by the two-step training setting leads, in some cases, to a visible improvement of the generated sentences.

View Full Paper