Experiments on the CLAP model for the DL course at IMTA (2023-2024)
To download all the datasets, run the dataset.sh
script.
cf. this page
We use the following datasets :
- ESC-50 : 50 classes of environmental sounds, 2000 samples, 5 seconds each.
- UrbanSound8K : 10 classes of urban sounds, 8732 samples, 4 seconds each.
- FMA-Small : 8 genres of music, 8000 samples, 30 seconds each.
- AudioSet : 527 classes of sounds, 2 084 320 samples, 10 seconds each. However, we only use a subset of a few classes (see figure below).
Image of the last audio processed by the model (from the ESC-50 dataset).
Running the main.py
script over the whole ESC-50 dataset on a GTX1060, consumes : 1321MiB / 6144MiB
of GPU RAM and takes less than 20 minutes to complete.
We also tried to augment the labels of the ESC-50 dataset, by turning words into full sentences. For example, the label dog
becomes A dog is barking
. The idea is to give more context to the model, and to make it learn more about the meaning of the sounds.
We gained more than 10% of accuracy, and the confusion matrix looks better.
On 2000 samples of the UrbanSound8K dataset, the model takes about 35 minutes to run on a GTX1060.
Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)
Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)
The accuracy on the FMA-Small dataset is very low, we think this might be related to poor labels. We tried to augment the labels, but it didn't improve the accuracy by much.
There are some clusters, but the labels are not very accurate. It is however suitable for sound retrieval.