Table of Contents
We provide all checkpoints of our EVAs for video recognition.
model name | #param. | pre-training epochs on merged-30M | weight |
---|---|---|---|
eva_psz14 |
1.0B | 150 | 🤗 HF link (2GB ) |
EVA is an open billion-scale vision foundation model, pre-trained on the merged-30M dataset.
dataset | model name | init. weight | acc@1 | config | weight | logs |
---|---|---|---|---|---|---|
Kinetics722 | eva_video_k722 |
eva_psz14 |
- | config | 🤗 HF link (4.8GB ) |
ft_k722 |
Kinetics400 | eva_video_k400 |
eva_video_k722 |
89.7 | config | 🤗 HF link (4.8GB ) |
ft_k400 |
Kinetics600 | eva_video_k600 |
eva_video_k722 |
89.8 | config | 🤗 HF link (4.8GB ) |
ft_k600 |
Kinetics700 | eva_video_k700 |
eva_video_k722 |
82.9 | config | 🤗 HF link (4.8GB ) |
ft_k700 |
All pre-trained weights can be downloaded using the following script. If problems occur with the automatic download, please follow the instructions for a manual download within the script.
sh scripts/download_checkpoints.sh
To set up the environment, run the following:
conda create -n evavideo python=3.7
conda activate evavideo
pip install -r requirements.txt
Build torch w.r.t. cuda version from conda
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
Install Apex as follows
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We have successfully fine-tuned our EVA on our merged Kinetics-722
and Kinetics-400/600/700
with this codebase.
To make video decoding faster, we use decord to decode the videos on the fly.
Please refer to the official website and/or the official script to prepare the videos.
Symlink the downloaded dataset
ln -s /path_to_Kinectics-400_dataset data/k400
ln -s /path_to_Kinectics-600_dataset data/k600
ln -s /path_to_Kinectics-700_dataset data/k700
The folder structure should look like this:
video
├── ...
├── data
│ ├── k400/600/700 -> path_to_Kinetics-400/600/700
│ │ ├── train
│ │ │ ├── ${CLASS_NAME}/${VIDEO_ID}
│ │ ├── val
│ │ │ ├── ${CLASS_NAME}/${VIDEO_ID}
│ ├── k400/600/700/722_train.txt
│ ├── k400/600/700/722_val.txt
│ ├── k722_to_k400/600/700_mapping.npy
├── ...
We provide a convenient script to generate an annotation file list. Please follow notebooks/build_file_list.ipynb
to generate file lists given downloaded videos.
The merged dataset coined Kinetics-722 (K-722) integrates all valid training samples from Kinetics-400 (K-400), Kinetics-600 (K-600), and Kinetics-700 (K-700).
Notably, for a fair and legal comparison, we removed leaked videos in all validation sets and duplicated videos in all training sets based on youtube id
of the video.
Accordingly, the cleaned K-722 contains 0.63M training videos, covering 722 human action classes. We also provide our data list.
Now, you can train and test EVA on video data.
Note:
Since our method is built upon X-CLIP, it needs a textual description for each video category. For example, we provide the text description of Kinetics-722
in the file labels/kinetics722_labels.csv
. Here is the format:
$ head -n 5 labels/kinetics722_labels.csv
id,name
0,abseiling
1,acting in play
2,adjusting glasses
3,air drumming
The id
indicates the class id, while the name
denotes the text description.
Note that we disabled the text branch.
To evaluate EVA with 16 frames on Kinetics-400 using a single node with 8 gpus:
- multi-view evaluation
VIDEO_CONFIG=configs/kinetics400_ft.yaml
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$NNODES --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=12355 main.py -cfg ${SEG_CONFIG} --only_test --resume pretrained/eva_video_k400.pth \
--output /path/to/output --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
# expected results
# top-1 accuracy: 89.7
To evaluate EVA with 16 frames on Kinetics-600 using a single node with 8 gpus:
- multi-view evaluation
VIDEO_CONFIG=configs/kinetics600_ft.yaml
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$NNODES --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=12355 main.py -cfg ${SEG_CONFIG} --only_test --resume pretrained/eva_video_k600.pth \
--output /path/to/output --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
# expected results
# top-1 accuracy: 89.8
To evaluate EVA with 16 frames on Kinetics-700 using a single node with 8 gpus:
- multi-view evaluation
VIDEO_CONFIG=configs/kinetics700_ft.yaml
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$NNODES --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=12355 main.py -cfg ${SEG_CONFIG} --only_test --resume pretrained/eva_video_k700.pth \
--output /path/to/output --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3
# expected results
# top-1 accuracy: 82.9
The config files lie in configs
.
To train EVA with 8 frames on Kinetics-722 using 16 nodes (total_batch_size=256
):
VIDEO_CONFIG=configs/kinetics722_intermediate_ft.yaml
OUTPUT_ROOT=/path/to/video/output/
pretrained=pretrained/eva_psz14.pt # https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$nnodes \
--node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=12355 \
main.py -cfg ${VIDEO_CONFIG} \
--output ${OUTPUT_ROOT} \
--accumulation-steps 1 \
--opts MODEL.PRETRAINED ${pretrained}
For example, to train EVA with 16 frames on Kinetics-400 using 8 nodes (total_batch_size=256
):
VIDEO_CONFIG=configs/kinetics400_ft.yaml
OUTPUT_ROOT=/path/to/video/output/
pretrained=pretrained/eva_video_k722.pth # https://huggingface.co/BAAI/EVA/blob/main/eva_video_k722.pth
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$nnodes \
--node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=12355 \
main.py -cfg ${VIDEO_CONFIG} \
--output ${OUTPUT_ROOT} \
--accumulation-steps 4 \
--opts MODEL.PRETRAINED ${pretrained}
For example, to train EVA with 16 frames on Kinetics-600 using 8 nodes (total_batch_size=256
):
VIDEO_CONFIG=configs/kinetics600_ft.yaml
OUTPUT_ROOT=/path/to/video/output/
pretrained=pretrained/eva_video_k722.pth
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$nnodes \
--node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=12355 \
main.py -cfg ${VIDEO_CONFIG} \
--output ${OUTPUT_ROOT} \
--accumulation-steps 4 \
--opts MODEL.PRETRAINED ${pretrained}
For example, to train EVA with 16 frames on Kinetics-700 using 8 nodes (total_batch_size=256
):
VIDEO_CONFIG=configs/kinetics700_ft.yaml
OUTPUT_ROOT=/path/to/video/output/
pretrained=pretrained/eva_video_k722.pth
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$nnodes \
--node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=12355 \
main.py -cfg ${VIDEO_CONFIG} \
--output ${OUTPUT_ROOT} \
--accumulation-steps 4 \
--opts MODEL.PRETRAINED ${pretrained}
Note:
- We recommend setting the total batch size to 256. If memory or #GPUs is limited, you can use
--accumulation-steps
to maintain the total batch size. Specifically, here the effective total batch size is 64(GPUs_NUM
) x 1(TRAIN.BATCH_SIZE
) x 4(TRAIN.ACCUMULATION_STEPS
) = 256. - Please specify the data path in config file(
configs/*.yaml
). Also, you can set them by attaching an argument--opts DATA.ROOT /path/to/data DATA.TRAIN_FILE /path/to/train/list DATA.VAL_FILE /path/to/val/list
.
Disclaimer:
- Due to differences in Kinetics datasets, one of the uncertainties now is the generation of the data list in the
notebooks
.
EVA video action recognition is build with mmaction2, Swin, CLIP and X-CLIP. Thanks for their wonderful work!