¹ ² Chenyang Lyu, ³ Minghao Wu, ¹ * Longyue Wang, ¹ Xinting Huang,
¹ Bingshuai Liu, ¹ Zefeng Du, ¹ Shuming Shi, ¹ Zhaopeng Tu
¹ Tencent AI Lab, ² Dublin City University, ³ Monash University
*Longyue Wang is the corresponding author: [email protected]
Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image🖼️, video📹, audio🎵, and text📝 data, built upon the foundations of CLIP, Whisper, and LLaMA.
📰 Paper 🏗️ Model (via dropbox) 🏗️ Model (via weiyun) 🗃️ Dataset 🧱 Code 🧐 Video 🧑💻 Demo
- Introduction
- Key Features
- Architecture
- Alignment Strategy
- Installation
- Usage
- Future Work and Contributions
In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.
Macaw-LLM boasts the following unique features:
- Simple & Fast Alignment: Macaw-LLM enables seamless integration of multi-modal data through simple and fast alignment to LLM embeddings. This efficient process ensures quick adaptation of diverse data types.
- One-Stage Instruction Fine-Tuning: Our model streamlines the adaptation process through one-stage instruction fine-tuning, promoting a more efficient learning experience.
- New Multi-modal Instruction Dataset: We create a new multi-modal instruction dataset that covers diverse instructional tasks leveraging image and video modalities, which facilitates future work on multi-modal LLMs.
Macaw-LLM is composed of three main components:
- CLIP: Responsible for encoding images and video frames.
- Whisper: Responsible for encoding audio data.
- LLM (LLaMA/Vicuna/Bloom): The language model that encodes instructions and generates responses.
The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.
Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:
- Encoding multi-modal features with CLIP and Whisper.
- Feeding the encoded features into an attention function, wherein the multi-modal features serve as the query and the embedding matrix of LLaMA as the key and value.
- Injecting the outputs into the input sequence (before instruction tokens) of LLaMA, allowing for a streamlined alignment process with minimal additional parameters.
To install Macaw-LLM, follow these steps:
# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git
# Change to the Macaw-LLM directory
cd Macaw-LLM
# Install required packages
pip install -r requirements.txt
# Install ffmpeg
yum install ffmpeg -y
# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..
-
Downloading dataset:
- Text data: stanford_alpaca/alpaca_data.json
- Image data: COCO Dataset VQA Dataset
- Video data: Charades and Video Dialog
- Image instruction data: Macaw-LLM image instruction dataset
- Video instruction data: Macaw-LLM video instruction dataset
-
Dataset preprocessing:
- Place the data in three modalities to specific folders -
data/text/
,data/image/
,data/video/
- Extract frames and audio from videos:
python preprocess_data.py
- Transform supervised data to dataset:
python preprocess_data_supervised.py
- Transform unsupervised data to dataset:
python preprocess_data_unsupervised.py
- Place the data in three modalities to specific folders -
-
Training:
- Execute the training script (you can specify the training parameters inside):
./train.sh
- Execute the training script (you can specify the training parameters inside):
-
Inference:
- Execute the inference script (you can give any customized inputs inside):
./inference.sh
- Execute the inference script (you can give any customized inputs inside):
We present several examples that highlight the proficiency of our Macaw-LLM in understanding and following multi-modal instructions. These examples showcase our system's multi-modal ability to understand and generate responses based on images and videos. These examples demonstrate how our system comprehends visual content and produces high-quality, fluent responses in natural language conversations. Our system generates contextually relevant and informative answers to various questions about the image, demonstrating its capability to communicate about visual content naturally and fluently.
While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.
We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. 🤝
-
Evaluation: We show some examples showcasing the multi-modal ability of our Macaw-LLM. However, we acknowledge that these efforts may not be fully adequate for accurately and comprehensively demonstrate model capabilities. We aim to conduct extensive evaluation on our systems to evaluate its capability.
-
More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.
-
Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.
We would like to express our gratitude to the following open-source projects for their valuable contributions to Macaw-LLM:
- Stanford Alpaca for providing the Alpaca dataset, which we used in our experiments.
- Parrot for providing a helpful implementation of the training of LLaMA.
- CLIP for providing a strong image and video encoding model.
- Whisper for providing a strong audio encoding model.
- LLaMA for providing a powerful LLM.
We would also like to thank the developers and maintainers of these projects for their dedication and hard work in making their projects open-source and accessible to the community.
@article{lyu2023macaw,
title={Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration},
author={Lyu, Chenyang and Wu, Minghao and Wang, Longyue and Huang, Xinting and Liu, Bingshuai and Du, Zefeng and Shi, Shuming and Tu, Zhaopeng},
journal={arXiv preprint arXiv:2306.09093},
year={2023}
}