Awesome-MM-Learning

Image-Language

Denoising Diffusion Probabilistic Models
[paper] [blog post]
Known as: DDPMs, diffusion models, score-based generative models or simply autoencoders.
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
[paper] [code]
Minigpt-4: Enhancing vision-language understanding with advanced large language models
[ICLR 2023] [paper] [project page]
GILL: Generating Images with Multimodal Language Models
[NeurIPS 2023] [paper] [code]
Captioning loss: The goal is to minimize the difference between the generated captions and the ground truth captions provided in the training data. The most common type of captioning loss is cross-entropy loss.
Single-stage training
BLIP-2
[paper]
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
[paper]
Robust Multimodal Learning via Representation Decoupling
[paper]

PandaGPT: One Model To Instruction-Follow Them All
[TLLM 2023] [paper] [project page] [code]
Task:
(1) Image-Text Tasks: image description generation
(2) Video-Text Tasks: writing stories inspired by videos
(3) Audio-Text Tasks: answering questions about audios
Data: 160k image-text instruction-following data released by LlaVa and MiniGPT-4
Model: ImageBind + Vicuna
Enhance the Robustness in Text-Centric Multimodal Alignments
[paper]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md