MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild
This repository provides an official implementation for the paper MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild.
Please create an environment with Python 3.10 and use requirements file to install the rest of the libraries
pip install -r reqiurements.txt
We provide the codes for DFEW and MAFW datasets, which you would need to download. The annotations are provided in annotations/ directory. You would need to update the paths to your own - there are preprocessing scripts in the same directory to preprocess data and rename the paths.
For MAFW dataset, you would need to extract faces from videos. Please refer to data_utils that has an example of face detection pipeline. To extract audio from the video files (in both datasets), use the following script (after modifying the paths to your own).
You will also need to download pre-trained checkpoints for vision encoder from https://github.com/FuxiVirtualHuman/MAE-Face and for audio encoder from https://github.com/facebookresearch/AudioMAE Please extract them and rename the audio checkpoint to 'audiomae_pretrained.pth'. Both checkpoints are expected to be in root folder.
The main script in main.py. You can invoke it through running:
./train_DFEW.sh
./train_MAFW.sh
You can download pre-trained models on DFEW from here and on MAFW from here. Please respect the dataset license when downloading the models! Evaluation can be done as follows:
python evaluate.py --fold $FOLD --checkpoint $CHECKPOINT_PATH --img-size $IMG_SIZE --dataset [MAFW|DFEW]
This repository is based on DFER-CLIP https://github.com/zengqunzhao/DFER-CLIP. We also thank the authors of MAE-Face https://github.com/FuxiVirtualHuman/MAE-Face and Audiomae https://github.com/facebookresearch/AudioMAE
If you use our work, please cite as:
@InProceedings{Chumachenko_2024_CVPR,
author = {Chumachenko, Kateryna and Iosifidis, Alexandros and Gabbouj, Moncef},
title = {MMA-DFER: MultiModal Adaptation of Unimodal Models for Dynamic Facial Expression Recognition In-the-wild},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2024},
pages = {4673-4682}
}