简体中文 | English
MSR-VTT(Microsoft Research Video to Text) is a large-scale dataset containing videos and subtitles, which is composed of 10000 video clips from 20 categories, and each video clip is annotated with 20 English sentences. We used 9000 video clips for training and 1000 for testing. For more details, please refer to the website: MSRVTT
For ease of use, we provided extracted features of video.
First, make sure to enter the following command in the applications/T2VLAD/data
directory to download the dataset.
bash download_features.sh
After downloading, the files in the data directory are organized as follows:
├── data
| ├── MSR-VTT
| │ ├── raw-captions.pkl
| │ ├── train_list_jsfusion.txt
| │ ├── val_list_jsfusion.txt
| │ ├── aggregated_text_feats
| | | ├── w2v_MSRVTT_openAIGPT.pickle
| | ├── mmt_feats
| │ │ ├── features.audio.pkl
| │ │ ├── features.face_agg.pkl
| │ │ ├── features.flos_agg.pkl
| │ │ ├── features.ocr.pkl
| │ │ ├── features.rgb_agg.pkl
| │ │ ├── features.s3d.pkl
| │ │ ├── features.scene.pkl
| │ │ ├── features.speech.pkl
Download data features:
wget https://videotag.bj.bcebos.com/Data/ActBERT/msrvtt_test.lmdb.tar
wget https://videotag.bj.bcebos.com/Data/ActBERT/MSRVTT_JSFUSION_test.csv
Decompress the msrvtt_test.lmdb.tar
tar -zxvf msrvtt_test.lmdb.tar
The files in the data directory are organized as follows:
├── data
| ├── MSR-VTT
| │ ├── MSRVTT_JSFUSION_test.csv
| │ ├── msrvtt_test.lmdb
| │ ├── data.mdb
| │ ├── lock.mdb
- Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In ECCV, 2020.