I'm a graduate student at University of Washington focused on computer vision research in large multimodal models, video understanding, embodied agents, and generative models. I love exploring innovative ideas that bridge theory and real-world applications. I believe computer vision will continue to advance.
- 🎓 Graduate Student at University of Washington
- 📚 Former undergraduate at Zhejiang University
- 🔬 Research interests: video understanding, LMM, and generative models
- 🤝 Collaborated with leading labs and researchers including the Pika Labs, Microsoft Research Asia, Information Processing Lab, CVNext Lab, and more.
Docs & Profiles:
- To junior master/undergraduate students: if you would like to chat about life, career plan, or research ideas related to AI/ML, feel free to send me zoom / google meet invitation via email to schedule a meeting. I will dedicate at least 30 mins every week for such meetings. I encourage students from underrepresented groups to reach out.
- We are hosting Discord server among professors and students for arXiv daily sharing and research discussion.
My current research currently focus on developing visual intelligence to understand the physical world, building upon video understanding as a core perceptual mechanism. I propose a long-short term memory framework modeled after the human memory system, enabling pre-trained video Large Multi-modal Models (LMMs) to comprehend hour-long video content without additional fine-tuning. To enhance efficiency, I introduce token merging to LMMs, significantly reducing visual tokens with minimal performance degradation. I also demonstrate step-by-step agent system development in Minecraft, showcasing cognitive-inspired agent capabilities in virtual environments.
I have just begun shaping my research narrative around visual intelligence. With "pretraining as we know it will end", I believe the future of artificial intelligence lies in aligning with human intelligence—and ultimately surpassing it. I truly believe computer vision will continue to advance.
Here are some of my recent key projects and publications:
-
(ICLR 2025) AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Efficient and performant video detailed captioning, plus a new benchmark.
Project Page | Paper | Code -
(CVPR 2024) MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
From dense token to sparse memory: advancing long video understanding with state-of-the-art memory mechanisms.
Project Page | Paper | Code -
(ICCV 2023) StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Text-driven, consistency-aware diffusion video editing for generating coherent video content.
Project Page | Paper | Code
For a full list of my publications and detailed project descriptions, please visit my website.
- 📧 Email: [email protected]
- 🐦 Twitter: @wenhaocha1
- 💼 LinkedIn: Wenhao Chai
- 🤗 Hugging Face: Profile
This README is inspired by my personal website – rese1f.github.io. Feel free to explore for more details about my research and projects.