In this paper we propose a novel modification of CLIP guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.
In RAVE we exploit arithmetic defined in the CLIP latent space. Using well-lit and backlit training data, we construct a residual vector, which will then be used for enhancement model guidance. This is a vector that points in a direction moving from backlit images to well-lit images in the CLIP embedding space. We then use this vector as guidance for the image enhancement model during training. This will train the image enhancement model to produce images with CLIP latent vectors that are close to the CLIP latent vectors of well-lit training images.- 2024.08.26: Code for training and testing as well as model checkpoints are publicly available now.
Training and testing data can be downloaded from:
- BAID dataset (train and test parts);
- DIV2K images (well-lit images used instead of well-lit images from BAID for training models in unpaired setting);
- LOL-v1 dataset for low-light image enhancement task (see supplementary material of RAVE paper for results on this data).
Train CLIP-LIT:
python train.py --cfg ./configs/train/clip_lit.yaml
Train CLIP-LIT-Latent:
python train.py --cfg ./configs/train/clip_lit_latent.yaml
Before running, make sure paths to training data in the config are correct (backlit_images_path
and welllit_images_path
in config)
If you have pre-trained Unet and/or guidance model checkpoints, you can resume training by changing arguments load_pretrain
corresponding to Unet/guidance model in the config. For more information on config arguments see Readme.md in config directory.
Train RAVE:
python train_rave.py --cfg ./configs/train/rave.yaml
Before running, make sure paths to training data in the config are correct (backlit_images_path
and welllit_images_path
in config)
To train RAVE with shifted residual by n tokens, change the remove_first_n_tokens
argument in the config.
Pretrained checkpoints for all the models are stored in pretrained_models
dir.
Models trained on paired data:
- CLIP-LIT: clip_lit_paired.pth;
- CLIP-LIT-Latent: clip_lit_latent_paired.pth;
- RAVE: rave_paired.pth.
Models trained on unpaired data:
- CLIP-LIT: clip_lit_unpaired.pth;
- CLIP-LIT-Latent: clip_lit_latent_unpaired.pth;
- RAVE without shifting the residual: rave_unpaired.pth;
- RAVE with shifting the residual by 15 tokens: rave_unpaired_shifted.pth.
To run trained model on backlit images use the following command:
python inference.py --cfg ./configs/inference/inference.yaml
Before running, make sure that the path to testing data in the config is correct (input
in config)
To compute metrics (SSIM, PSNR, LPIPS) on bunch of backlit and corresponding enhanced images, use the following command:
python compute_metrics.py --cfg ./configs/inference/metrics.yaml
Before running, make sure that the paths to ground-truth well-lit data and enhanced images in the config are correct (gt_images_path
and enhanced_images_path
in config)
If you find our work useful, please consider citing the paper:
@misc{gaintseva2024raveresidualvectorembedding,
title={RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement},
author={Tatiana Gaintseva and Martin Benning and Gregory Slabaugh},
year={2024},
eprint={2404.01889},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.01889},
}
Please feel free to reach out at [email protected]
.