Guides
- Requirements and Installation
- Model Checkpoints
- Feature Extraction
- Data Preparation
- Pre-Training
- Fine-Tuning
- Inference and Evaluation
- We release EAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
- We have updated the checkpoints and code, and now EAT seamlessly supports variable-length audio throughout training, feature extraction, inference, and evaluation phases.
EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paper EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.
The minimum environment requirements are Python >= 3.8
and PyTorch >= 1.13
. You could find the versions of other dependencies we use in requirements.txt
.
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT
You could download the EAT-base (10 epochs) checkpoints by Google Drive.
- AS-2M Pre-trained
- AS-2M Pre-trained+Fine-tuned (AS-2M)
- AS-2M Pre-trained+Fine-tuned (AS-20K)
Update!!!!! 🆕 (RECOMMEND)
We have introduced two new variants of the EAT pre-training model and their fine-tuned versions, each designed to enhance performance through either extended pre-training epochs or scaling up the model size.
Links for model checkpoints:
- EAT-base_epoch30 (pre-training)
- EAT-base_epoch30 (fine-tuning on AS-2M)
- EAT-large_epoch20 (pre-training)
- EAT-large_epoch20 (fine-tuning on AS-2M)
Performance metrics:
Model | Backbone | Parameters | Pre-training Epoch |
AS-20K mAP(%) |
AS-2M mAP(%) |
---|---|---|---|---|---|
EAT-base | ViT-B | 88M | 10 | 40.3 | 48.6 |
EAT-base | ViT-B | 88M | 30 | 41.3 | 48.9 |
EAT-large | ViT-L | 309M | 20 | 42.0 | 49.5 |
We provide the script for extracting audio features from the last layer of EAT encoder. The features are stored in .npy
format and the sample rate of the extracted features is ~50Hz. EAT could provide frame-level features and utterance-level features (denoted by the CLS token).
To extract latent representations from audio clips, you could use our pre-trained checkpoint, fine-tuned checkpoint or your owns, then please run the script feature_extract.sh
by:
bash EAT/scripts/feature_extract.sh
The main dataset in our experiment is AudioSet. Regrettably, we are unable to release the data due to copyright restrictions. Data manifest is available at here. We follow the file format in wav2vec and data2vec, where .tsv
format file is for index while .lbl
and .csv
format files are specific for classification task. You could modify the files for your own database.
Our codes are adapted from Audio-MAE and data2vec. We employ pretraining_AS2M.yaml
as our default pre-training config. To pre-train the EAT model on Audioset, you could run the script pretraining_AS2M.sh
by:
bash EAT/scripts/pretraining_AS2M.sh
If you need to pre-train the EAT model on other datasets where audio lengths are not fixed at 10 seconds, you can refer to the instructions in
feature_extract/readme.md
We employ finetuning.yaml
as our default fine-tuning config. To fine-tune the EAT model in different downstream tasks, you could run the script finetuning_{task}.sh
, where {task}
includes AS20K
, AS2M
, ESC50
and SPCv2
. For example, you can fine-tune EAT on AS20K
by executing:
bash EAT/scripts/finetuning_AS20K.sh
For inference on single AudioSet audio clip with fine-tuned models, you could use our EAT checkpoints fine-tuning on AS-2M (recommended) or AS-20K
and run the script inference.sh
by:
bash EAT/scripts/inference.sh
An example output is as follows:
# top_k_prediction = 12
************ Acoustic Event Inference ************
LABEL PREDICTION
Percussion 0.523
Drum kit 0.437
Vibraphone 0.420
Drum 0.316
Music 0.303
Snare drum 0.277
Glockenspiel 0.225
Marimba, xylophone 0.223
Cymbal 0.213
Bass drum 0.207
Hi-hat 0.196
Mallet percussion 0.170
**************************************************
For comprehensive evaluation on the entire AudioSet eval dataset with fine-tuned EAT models, you could run the evaluation script eval.sh
by:
bash EAT/scripts/eval.sh
This script will give you the evaluation value of mAP on AudioSet test dataset.
Per-class AP can be found under the path ./EAT/ap_log.txt
. You could also refer to our results of finetuned EAT models on evaluation set of Audioset under the path ./EAT/results
.
Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.
EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. It costs only 10 epochs during EAT's pre-training on AS-2M.
We report the experiment logs using wandb. We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit it here.
- release the final EAT large
- update codes and checkpoints for friendly usage
- release the docker image
If you find our EAT codes and models useful, please cite the following paper:
@article{chen2024eat,
title={EAT: Self-Supervised Pre-Training with Efficient Audio Transformer},
author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
journal={arXiv preprint arXiv:2401.03497},
year={2024}
}
Our codebase is based on the awesome Audio-MAE and data2vec repo.