Skip to content

Soft-MoE Projector Layer + Variant Of Vision Encoders Support For LLaVA1.5

License

Notifications You must be signed in to change notification settings

AhmedZeer/Dissecting-VLMs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dissecting-VLMs

This repository builds upon the foundational work of the LLaVA project. Special thanks to @haotian-liu for making the research on multimodal models more accessible through open-sourcing the llava repository.

Enhancements in This Repository

  • Soft Mixture of Expert Projector Layer: A new architectural component for improved model performance.
  • Integration of Advanced Models:
    • SigLIP
    • AIMv2
    • internViT
    • DinoV2
  • Multinode Training with SLURM: Includes SLURM scripts to facilitate multinode training and scaling.

Code Modifications

Key changes were made to the following modules:

  • llava/model: Updates to support new features and integrations.
  • llava/train: Modifications for advanced training workflows.

Getting Started

To use the features of this repository, clone it and explore the updated model and train directories. Detailed instructions for multinode training using SLURM are available in the provided scripts.

Acknowledgments

This repository would not have been possible without the contributions of the LLaVA project and its authors. Their work lays the groundwork for further exploration and innovation in vision-language models.

For more details, refer to the LLaVA repository.

About

Soft-MoE Projector Layer + Variant Of Vision Encoders Support For LLaVA1.5

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.0%
  • Shell 21.9%
  • JavaScript 1.9%
  • HTML 1.5%
  • Other 0.7%