This repository builds upon the foundational work of the LLaVA project. Special thanks to @haotian-liu for making the research on multimodal models more accessible through open-sourcing the llava
repository.
- Soft Mixture of Expert Projector Layer: A new architectural component for improved model performance.
- Integration of Advanced Models:
- SigLIP
- AIMv2
- internViT
- DinoV2
- Multinode Training with SLURM: Includes SLURM scripts to facilitate multinode training and scaling.
Key changes were made to the following modules:
llava/model
: Updates to support new features and integrations.llava/train
: Modifications for advanced training workflows.
To use the features of this repository, clone it and explore the updated model
and train
directories. Detailed instructions for multinode training using SLURM are available in the provided scripts.
This repository would not have been possible without the contributions of the LLaVA project and its authors. Their work lays the groundwork for further exploration and innovation in vision-language models.
For more details, refer to the LLaVA repository.