Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
AMD | no | ViT-S | 800 | 16x2x3 | log/checkpoint | log/checkpoint | 70.2 | 92.5 |
AMD | no | ViT-B | 800 | 16x2x3 | log/checkpoint | log/checkpoint | 73.3 | 94.0 |
Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
AMD | no | ViT-S | 800 | 16x5x3 | log/checkpoint | log/checkpoint | 80.1 | 94.5 |
AMD | no | ViT-B | 800 | 16x5x3 | log/checkpoint | log/checkpoint | 82.2 | 95.3 |
Method | Extra Data | Backbone | Epoch | Pre-train | Fine-tune | Top-1 |
---|---|---|---|---|---|---|
AMD | no | ViT-S | 800 | log/checkpoint | log/checkpoint | 82.1 |
AMD | no | ViT-B | 800 | log/checkpoint | log/checkpoint | 84.6 |
- We report the results of AMD finetuned with
I3D dense sampling
on Kinetics-400 andTSN uniform sampling
on Something-Something V2, respectively. - #Frame = #input_frame x #clip x #crop.
- #input_frame means how many frames are input for model during the test phase.
- #crop means spatial crops (e.g., 3 for left/right/center crop).
- #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).