Understanding Video: Perceiving dynamic actions could be a huge advance in how software makes sense of the world.(from MIT Technology Review December 6, 2017)
A list of resources for video understanding. Most of papers can be searched by scholar.google.com.
This list is updated on December 13th 2017.
- Video Classification
- Action Recognition
- Video Captioning: will be updated
- Temporal Action Detection: will be updated
- Video Datasets
- image-based methods
- Zha S, Luisier F, Andrews W, et al. Exploiting Image-trained CNN Architectures for Unconstrained Video Classification[J]. Computer Science, 2015.
- Sánchez J, Perronnin F, Mensink T, et al. Image Classification with the Fisher Vector: Theory and Practice[J]. International Journal of Computer Vision, 2013, 105: 222-245.
- CNN-based methods
- Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
- Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
- Fernando B, Gould S. Learning end-to-end video classification with rank-pooling[C]//International Conference on Machine Learning. 2016: 1187-1196.
- RNN-based methods
- Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015: 461-470.
- Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4694-4702.
- CNN-based methods
- Ji S, Xu W, Yang M, et al. 3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1):221-231.
- Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
- Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. arXiv preprint arXiv:1604.04494, 2016.
- Sun L, Jia K, Yeung D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4597-4605.
- Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in neural information processing systems. 2014: 568-576.
- Ye H, Wu Z, Zhao R W, et al. Evaluating two-stream CNN for video classification[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 435-442.
- Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4305-4314.
- Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1933-1941.
- Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 20-36.
- Zhang B, Wang L, Wang Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2718-2726.
- Wang X, Farhadi A, Gupta A. Actions~ transformations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2658-2667.
- Zhu W, Hu J, Sun G, et al. A key volume mining deep framework for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1991-1999.
- Bilen H, Fernando B, Gavves E, et al. Dynamic image networks for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3034-3042.
- Fernando B, Anderson P, Hutter M, et al. Discriminative hierarchical rank pooling for activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1924-1932.
- Cherian A, Fernando B, Harandi M, et al. Generalized rank pooling for activity recognition[J]. arXiv preprint arXiv:1704.02112, 2017.
- Fernando B, Gavves E, Oramas J, et al. Rank pooling for action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(4): 773-787.
- Fernando B, Gould S. Discriminatively Learned Hierarchical Rank Pooling Networks[J]. arXiv preprint arXiv:1705.10420, 2017.
- RNN-based methods
- Baccouche M, Mamalet F, Wolf C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Springer, Berlin, Heidelberg, 2011: 29-39.
- Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 2625-2634.
- Veeriah V, Zhuang N, Qi G J. Differential recurrent neural networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4041-4049.
- Li Q, Qiu Z, Yao T, et al. Action recognition by learning deep multi-granular spatio-temporal video representation[C]//Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016: 159-166.
- Wu Z, Jiang Y G, Wang X, et al. Multi-stream multi-class fusion of deep networks for video classification[C]//Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016: 791-800.
- Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015.
- Li Z, Gavves E, Jain M, et al. VideoLSTM convolves, attends and flows for action recognition[J]. arXiv preprint arXiv:1607.01794, 2016.
- Unsupervised learning methods
- Taylor G W, Fergus R, LeCun Y, et al. Convolutional learning of spatio-temporal features[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2010: 140-153.
- Le Q V, Zou W Y, Yeung S Y, et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 3361-3368.
- Yan X, Chang H, Shan S, et al. Modeling video dynamics with deep dynencoder[C]//European Conference on Computer Vision. Springer, Cham, 2014: 215-230.
- Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learning of video representations using lstms[C]//International Conference on Machine Learning. 2015: 843-852.
- Pan Y, Li Y, Yao T, et al. Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure[C]//IJCAI. 2016: 3832-3838.
- Ballas N, Yao L, Pal C, et al. Delving deeper into convolutional networks for learning video representations[J]. arXiv preprint arXiv:1511.06432, 2015.
- HMDB51
- Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]//Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011: 2556-2563.
- state-of-the-art: 75%
- Lan Z, Zhu Y, Hauptmann A G. Deep Local Video Feature for Action Recognition[J]. arXiv preprint arXiv:1701.07368, 2017.
- UCF-101
- Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
- state-of-the-art: 95.6%
- Diba A, Sharma V, Van Gool L. Deep temporal linear encoding networks[J]. arXiv preprint arXiv:1611.06678, 2016.
- ActivityNet
- Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 961-970.
- state-of-the-art: 91.3%
- Wang L, Xiong Y, Lin D, et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection[J]. arXiv preprint arXiv:1703.03329, 2017.
- Sports-1M
- Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
- state-of-the-art: 67.6%
- Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
- YouTube-8M
- Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
- state-of-the-art: 84.967%
- Miech A, Laptev I, Sivic J. Learnable pooling with Context Gating for video classification[J]. arXiv preprint arXiv:1706.06905, 2017.
- Kinetics
- Kay W, Carreira J, Simonyan K, et al. The Kinetics Human Action Video Dataset[J]. arXiv preprint arXiv:1705.06950, 2017.
- state-of-the-art: ?
- Moments in Time Dataset
- Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Tom Yan, Alex Andonian, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, Aude Oliva.Moments in Time Dataset: one million videos for event understanding. tech report
- state-of-the-art: ?