This project tries to implement RecNet proposed in Reconstruction Network for Video Captioning [1], CVPR 2018.
- Ubuntu 16.04
- CUDA 9.0
- cuDNN 7.3.1
- Nvidia Geforce GTX Titan Xp 12GB
- Java 8
- Python 2.7.12
- PyTorch 1.0
- Other python libraries specified in requirements.txt
$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt
-
Extract Inception-v4 [2] features from datasets, and locate them at
<PROJECT ROOT>/<DATASET>/features/<DATASET>_InceptionV4.hdf5
. I extracted the Inception-v4 features from here.Dataset Inception-v4 MSVD link MSR-VTT link -
Split the dataset along with the official splits by running following:
(.env) $ python -m splits.MSVD (.env) $ python -m splits.MSR-VTT
Clone evaluation codes from the official coco-evaluation repo.
(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption
-
Stage 1 (Encoder-Decoder)
(.env) $ python train.py -c configs.train_stage1
-
Stage 2 (Encoder-Decoder-Reconstructor
Set the
pretrained_decoder_fpath
ofTrainConfig
inconfigs/train_stage2.py
as the checkpoint path saved at stage 1, then run(.env) $ python train.py -c configs.stage2
You can change some hyperparameters by modifying configs/train_stage1.py
and configs/train_stage2.py
.
- Set the checkpoint path by changing
ckpt_fpath
ofRunConfig
inconfigs/run.py
. - Run
(.env) $ python run.py
* NOTE: As you can see, the performance of RecNet does not outperform SA-LSTM. Better hyperparameters should be found out.
-
MSVD
Model BLEU4 CIDEr METEOR ROUGE_L pretrained SA-LSTM 45.3 76.2 31.9 64.2 - RecNet (global) 51.1 79.7 34.0 69.4 - RecNet (local) 52.3 80.3 34.1 69.8 - (Ours) SA-LSTM 50.9 79.6 33.4 69.6 link (Ours) RecNet (global) 49.9 78.7 33.2 69.7 link (Ours) RecNet (local) 49.8 79.4 33.2 69.6 link -
MSR-VTT
Model BLEU4 CIDEr METEOR ROUGE_L pretrained SA-LSTM 36.3 39.9 25.5 58.3 - RecNet (global) 38.3 41.7 26.2 59.1 - RecNet (local) 39.1 42.7 26.6 59.3 - (Ours) SA-LSTM 38.0 40.2 25.6 58.1 link (Ours) RecNet (global) 37.4 40.0 25.5 58.0 link (Ours) RecNet (local) 37.9 40.9 25.7 58.3 link
[1] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[2] Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.