Awesome Video Understanding

Understanding Video: Perceiving dynamic actions could be a huge advance in how software makes sense of the world.(from MIT Technology Review December 6, 2017)

A list of resources for video understanding. Most of papers can be searched by scholar.google.com.

This list is updated on December 13th 2017.

Video Classification
Action Recognition
Video Captioning: will be updated
Temporal Action Detection: will be updated
Video Datasets

Papers

Video Classification

image-based methods
- Zha S, Luisier F, Andrews W, et al. Exploiting Image-trained CNN Architectures for Unconstrained Video Classification[J]. Computer Science, 2015.
- Sánchez J, Perronnin F, Mensink T, et al. Image Classification with the Fisher Vector: Theory and Practice[J]. International Journal of Computer Vision, 2013, 105: 222-245.
CNN-based methods
- Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
- Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
- Fernando B, Gould S. Learning end-to-end video classification with rank-pooling[C]//International Conference on Machine Learning. 2016: 1187-1196.
RNN-based methods
- Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015: 461-470.
- Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4694-4702.

Action Recognition

CNN-based methods
- Ji S, Xu W, Yang M, et al. 3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1):221-231.
- Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
- Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. arXiv preprint arXiv:1604.04494, 2016.
- Sun L, Jia K, Yeung D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4597-4605.
- Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in neural information processing systems. 2014: 568-576.
- Ye H, Wu Z, Zhao R W, et al. Evaluating two-stream CNN for video classification[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 435-442.
- Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4305-4314.
- Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1933-1941.
- Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 20-36.
- Zhang B, Wang L, Wang Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2718-2726.
- Wang X, Farhadi A, Gupta A. Actions~ transformations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2658-2667.
- Zhu W, Hu J, Sun G, et al. A key volume mining deep framework for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1991-1999.
- Bilen H, Fernando B, Gavves E, et al. Dynamic image networks for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3034-3042.
- Fernando B, Anderson P, Hutter M, et al. Discriminative hierarchical rank pooling for activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1924-1932.
- Cherian A, Fernando B, Harandi M, et al. Generalized rank pooling for activity recognition[J]. arXiv preprint arXiv:1704.02112, 2017.
- Fernando B, Gavves E, Oramas J, et al. Rank pooling for action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(4): 773-787.
- Fernando B, Gould S. Discriminatively Learned Hierarchical Rank Pooling Networks[J]. arXiv preprint arXiv:1705.10420, 2017.
RNN-based methods
- Baccouche M, Mamalet F, Wolf C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Springer, Berlin, Heidelberg, 2011: 29-39.
- Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 2625-2634.
- Veeriah V, Zhuang N, Qi G J. Differential recurrent neural networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4041-4049.
- Li Q, Qiu Z, Yao T, et al. Action recognition by learning deep multi-granular spatio-temporal video representation[C]//Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016: 159-166.
- Wu Z, Jiang Y G, Wang X, et al. Multi-stream multi-class fusion of deep networks for video classification[C]//Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016: 791-800.
- Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015.
- Li Z, Gavves E, Jain M, et al. VideoLSTM convolves, attends and flows for action recognition[J]. arXiv preprint arXiv:1607.01794, 2016.
Unsupervised learning methods
- Taylor G W, Fergus R, LeCun Y, et al. Convolutional learning of spatio-temporal features[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2010: 140-153.
- Le Q V, Zou W Y, Yeung S Y, et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 3361-3368.
- Yan X, Chang H, Shan S, et al. Modeling video dynamics with deep dynencoder[C]//European Conference on Computer Vision. Springer, Cham, 2014: 215-230.
- Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learning of video representations using lstms[C]//International Conference on Machine Learning. 2015: 843-852.
- Pan Y, Li Y, Yao T, et al. Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure[C]//IJCAI. 2016: 3832-3838.
- Ballas N, Yao L, Pal C, et al. Delving deeper into convolutional networks for learning video representations[J]. arXiv preprint arXiv:1511.06432, 2015.

Video Datasets

HMDB51
- Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]//Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011: 2556-2563.
- state-of-the-art: 75%
  - Lan Z, Zhu Y, Hauptmann A G. Deep Local Video Feature for Action Recognition[J]. arXiv preprint arXiv:1701.07368, 2017.
UCF-101
- Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
- state-of-the-art: 95.6%
  - Diba A, Sharma V, Van Gool L. Deep temporal linear encoding networks[J]. arXiv preprint arXiv:1611.06678, 2016.
ActivityNet
- Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 961-970.
- state-of-the-art: 91.3%
  - Wang L, Xiong Y, Lin D, et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection[J]. arXiv preprint arXiv:1703.03329, 2017.
Sports-1M
- Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
- state-of-the-art: 67.6%
  - Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
YouTube-8M
- Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
- state-of-the-art: 84.967%
  - Miech A, Laptev I, Sivic J. Learnable pooling with Context Gating for video classification[J]. arXiv preprint arXiv:1706.06905, 2017.
Kinetics
- Kay W, Carreira J, Simonyan K, et al. The Kinetics Human Action Video Dataset[J]. arXiv preprint arXiv:1705.06950, 2017.
- state-of-the-art: ?
Moments in Time Dataset
- Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Tom Yan, Alex Andonian, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, Aude Oliva.Moments in Time Dataset: one million videos for event understanding. tech report
- state-of-the-art: ?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Video Understanding

Table of Contents

Papers

Video Classification

Action Recognition

Video Datasets

About

Releases

Packages

sujiongming/awesome-video-understanding

Folders and files

Latest commit

History

Repository files navigation

Awesome Video Understanding

Table of Contents

Papers

Video Classification

Action Recognition

Video Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages