The official repository which contains the code and model checkpoints for our paper Autoregressive Pre-Training on Pixels and Texts (EMNLP 2024).
- 21 September, 2024: 🎉 Our work has been accepted to EMNLP 2024! 🎉
- 1 May, 2024: 🎉 We release the official codebase and model weights of
PixelGPT
,MonoGPT
, andDualGPT
. Stay tuned!🔥
Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.
To set up the environment and install dependencies, run:
bash run_requirements.sh
We fine-tune PixelGPT on the rendered GLUE and XNLI datasets. These rendered versions are publicly available at baidu/rendered_GLUE and baidu/rendered_xnli. After downloading the datasets from HuggingFace, extract them locally:
# Extract rendered GLUE
tar -xvf rendered_glue.tar
# Extract rendered XNLI
tar -xvf rendered_xnli.tar
For the rendered GLUE dataset, the extracted files contain multiple tasks. Each task has a corresponding training set, validation set, and test set. Note that for the MNLI task, both the validation and test sets contain matched and mismatched versions. You will need to assign the local paths of these task datasets to the --train_file
, --validation_file
, and --test_file
parameters in the fine-tuning script.
For the rendered XNLI dataset, assign the local dataset path to the --data_file_dir
parameter in the corresponding fine-tuning script.
We pre-trained PixelGPT and three other models: MonoGPT, and DualGPT. We release checkpoints used in our experiment, which can be downloaded at baidu/PixelGPT, baidu/MonoGPT, and baidu/DualGPT. Before running the fine-tuning scripts bellow, download the corresponding pre-trained models from our open-source model repository above and place the file in the pre-trained model directory, e.g. pretrained_models/pixel_gpt
.
Our main fine-tuning experiments were performed on rendered GLUE and XNLI. The scripts to run the experiments are given below.
For example, to fine-tune on the MNLI task:
bash run/pixel_gpt/ft_pixel_gpt_mnli.sh pretrained_models/PixelGPT
# Text-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_text.sh pretrained_models/MonoGPT
# Pixel-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pixel.sh pretrained_models/MonoGPT
# Pair-modality Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pair.sh pretrained_models/MonoGPT
# Text-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_text.sh pretrained_models/DualGPT
# Pixel-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pixel.sh pretrained_models/DualGPT
# Pair-modality Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pair.sh pretrained_models/DualGPT
We evaluated XNLI in two settings: (1) Translate-train-all, where the model is fine-tuned on a combination of English and machine-translated data from 14 other languages; (2) Cross-lingual Transfer settings, where the model is fine-tuned only on English data and tested on multiple languages.
bash run/cross_lingual/xnli/train_all/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT
# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT
# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT
# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT
# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT
bash run/cross_lingual/xnli/train_en/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT
# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT
# Pair-modality Fine-tuning
run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT
# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT
# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT
@inproceedings{chai-etal-2024-autoregressive,
title = "Autoregressive Pre-Training on Pixels and Texts",
author = "Chai, Yekun and
Liu, Qingyi and
Xiao, Jingwu and
Wang, Shuohuan and
Sun, Yu and
Wu, Hua",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.182",
pages = "3106--3125",
abstract = "The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language{---}both visual and textual{---}within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at https://github.com/ernie-research/pixelgpt.",
}