Skip to content
View rese1f's full-sized avatar

Highlights

  • Pro

Organizations

@CVNext

Block or report rese1f

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
rese1f/README.md

Hi, I'm Wenhao Chai 👋

I'm a graduate student at University of Washington focused on computer vision research in large multimodal models, video understanding, embodied agents, and generative models. I love exploring innovative ideas that bridge theory and real-world applications. I believe computer vision will continue to advance.

About Me

Docs & Profiles:

Pinned Message

  • To junior master/undergraduate students: if you would like to chat about life, career plan, or research ideas related to AI/ML, feel free to send me zoom / google meet invitation via email to schedule a meeting. I will dedicate at least 30 mins every week for such meetings. I encourage students from underrepresented groups to reach out.
  • We are hosting Discord server among professors and students for arXiv daily sharing and research discussion.

Research & Projects

My current research currently focus on developing visual intelligence to understand the physical world, building upon video understanding as a core perceptual mechanism. I propose a long-short term memory framework modeled after the human memory system, enabling pre-trained video Large Multi-modal Models (LMMs) to comprehend hour-long video content without additional fine-tuning. To enhance efficiency, I introduce token merging to LMMs, significantly reducing visual tokens with minimal performance degradation. I also demonstrate step-by-step agent system development in Minecraft, showcasing cognitive-inspired agent capabilities in virtual environments.

I have just begun shaping my research narrative around visual intelligence. With "pretraining as we know it will end", I believe the future of artificial intelligence lies in aligning with human intelligence—and ultimately surpassing it. I truly believe computer vision will continue to advance.

Here are some of my recent key projects and publications:

  • (ICLR 2025) AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
    Efficient and performant video detailed captioning, plus a new benchmark.
    Project Page | Paper | Code

  • (CVPR 2024) MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
    From dense token to sparse memory: advancing long video understanding with state-of-the-art memory mechanisms.
    Project Page | Paper | Code

  • (ICCV 2023) StableVideo: Text-driven Consistency-aware Diffusion Video Editing
    Text-driven, consistency-aware diffusion video editing for generating coherent video content.
    Project Page | Paper | Code

For a full list of my publications and detailed project descriptions, please visit my website.

Connect with Me


This README is inspired by my personal website – rese1f.github.io. Feel free to explore for more details about my research and projects.

github contribution grid snake animation

Pinned Loading

  1. yangchris11/samurai yangchris11/samurai Public

    Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"

    Python 6.5k 416

  2. StableVideo StableVideo Public

    [ICCV 2023] StableVideo: Text-driven Consistency-aware Diffusion Video Editing

    Python 1.4k 89

  3. MovieChat MovieChat Public

    [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

    Python 588 42

  4. Awesome-VQVAE Awesome-VQVAE Public

    A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application

    251 9

  5. aurora aurora Public

    [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

    Python 75 4

  6. PoseDA PoseDA Public

    [ICCV 2023] Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation

    22 1