-
Denoising Diffusion Probabilistic Models
[paper] [blog post]
Known as:DDPMs
,diffusion models
,score-based generative models
or simplyautoencoders
. -
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
[paper] [code] -
Minigpt-4: Enhancing vision-language understanding with advanced large language models
[ICLR 2023] [paper] [project page] -
GILL: Generating Images with Multimodal Language Models
[NeurIPS 2023] [paper] [code]
Captioning loss: The goal is to minimize the difference betweenthe generated captions
andthe ground truth captions
provided in the training data. The most common type of captioning loss iscross-entropy loss
.
Single-stage training -
BLIP-2
[paper] -
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
[paper] -
Robust Multimodal Learning via Representation Decoupling
[paper]
- PandaGPT: One Model To Instruction-Follow Them All
[TLLM 2023] [paper] [project page] [code]
Task:
(1)Image-Text Tasks
: image description generation
(2)Video-Text Tasks
: writing stories inspired by videos
(3)Audio-Text Tasks
: answering questions about audios
Data: 160k image-text instruction-following data released by LlaVa and MiniGPT-4
Model: ImageBind + Vicuna - Enhance the Robustness in Text-Centric Multimodal Alignments
[paper]