Learning Spatio-Temporal Transformer for Visual Tracking

Abstract

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6× faster than Siam R-CNN. Code and models are open-sourced at here.

Citation

@inproceedings{yan2021learning,
  title={Learning spatio-temporal transformer for visual tracking},
  author={Yan, Bin and Peng, Houwen and Fu, Jianlong and Wang, Dong and Lu, Huchuan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10448--10457},
  year={2021}
}

Results and models

The STARK is trained in 2 stages. We denote the 1st-stage model as STARK-ST1, and denote the 2nd-stage model as STARK-ST2. The following models we provide are the last-epoch models by default.

Models from the 2 stages have different configurations. For example, stark_st1_r50_500e_got10k is the configuration of the 1st-stage model and stark_st2_r50_50e_got10k is the configuration of the 2nd-stage model.

Note: We have to pass an extra parameter cfg-options containing the key load_from from shell command to load the pretrained 1st-stage model when training the 2nd-stage model. Here is an example:

bash ./tools/dist_train.sh \
    ${CONFIG_FILE} \
    ${GPU_NUM} \
    --cfg-options load_from=${STARK-ST1 model}

LaSOT

We provide the last-epoch model with its configuration and training log.

Method	Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	Success	Norm precision	Precision	Config	Download
STARK-ST1	R-50	-	500e	8.45	-	67.0	77.3	71.7	config	model \| log
STARK-ST2	R-50	-	50e	2.31	-	67.8	78.5	73.0	config	model \| log

TrackingNet

The results of STARK in TrackingNet are reimplemented by ourselves. The last-epoch model on TrackingNet is submitted to the evaluation server on TrackingNet Challenge. We provide the model with its configuration and training log.

Method	Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	Success	Norm precision	Precision	Config	Download
STARK-ST1	R-50	-	500e	8.45	-	80.3	85.0	77.7	config	model \| log
STARK-ST2	R-50	-	50e	2.31	-	81.4	86.2	79.0	config	model \| log

GOT10k

The results of STARK in GOT10k are reimplemented by ourselves. The last-epoch model on GOT10k is submitted to the evaluation server on GOT10k Challenge. We provide the model with its configuration and training log.

Method	Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	Success	Norm precision	Precision	Config	Download
STARK-ST1	R-50	-	500e	8.45	-	68.1	77.4	62.4	config	model \| log
STARK-ST2	R-50	-	50e	2.31	-	68.3	77.6	62.7	config	model \| log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Learning Spatio-Temporal Transformer for Visual Tracking

Abstract

Citation

Results and models

LaSOT

TrackingNet

GOT10k

Files

README.md

Latest commit

History

README.md

File metadata and controls

Learning Spatio-Temporal Transformer for Visual Tracking

Abstract

Citation

Results and models

LaSOT

TrackingNet

GOT10k