CLAP (Contrastive Language-Audio Pretraining)

Experiments on the CLAP model for the DL course at IMTA (2023-2024)

Installation

To download all the datasets, run the dataset.sh script.

Usage

cf. this page

Datasets

We use the following datasets :

ESC-50 : 50 classes of environmental sounds, 2000 samples, 5 seconds each.
UrbanSound8K : 10 classes of urban sounds, 8732 samples, 4 seconds each.
FMA-Small : 8 genres of music, 8000 samples, 30 seconds each.
AudioSet : 527 classes of sounds, 2 084 320 samples, 10 seconds each. However, we only use a subset of a few classes (see figure below).

Experiments

Last audio processed

Image of the last audio processed by the model (from the ESC-50 dataset).

A few experiments results on the ESC-50 dataset

Running the main.py script over the whole ESC-50 dataset on a GTX1060, consumes : 1321MiB / 6144MiB of GPU RAM and takes less than 20 minutes to complete.

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

We also tried to augment the labels of the ESC-50 dataset, by turning words into full sentences. For example, the label dog becomes A dog is barking. The idea is to give more context to the model, and to make it learn more about the meaning of the sounds.

We gained more than 10% of accuracy, and the confusion matrix looks better.

t-SNE visualization of the ESC-50 dataset + labels

A few experiments results on the UrbanSound8K dataset

On 2000 samples of the UrbanSound8K dataset, the model takes about 35 minutes to run on a GTX1060.

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

t-SNE visualization of the UrbanSound8K dataset + labels

A few experiments results on the FMA-Small dataset

The accuracy on the FMA-Small dataset is very low, we think this might be related to poor labels. We tried to augment the labels, but it didn't improve the accuracy by much.

t-SNE visualization of the FMA-Small dataset + labels

There are some clusters, but the labels are not very accurate. It is however suitable for sound retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
cached_features		cached_features
figs		figs
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
augmentations.py		augmentations.py
dataset.py		dataset.py
dataset.sh		dataset.sh
explore.ipynb		explore.ipynb
features_gen.py		features_gen.py
features_viz.py		features_viz.py
genre.csv		genre.csv
main.py		main.py
sound_search.py		sound_search.py
sound_search_app.py		sound_search_app.py
usage.md		usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLAP (Contrastive Language-Audio Pretraining)

Installation

Usage

Datasets

Experiments

Last audio processed

A few experiments results on the ESC-50 dataset

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

t-SNE visualization of the ESC-50 dataset + labels

A few experiments results on the UrbanSound8K dataset

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

t-SNE visualization of the UrbanSound8K dataset + labels

A few experiments results on the FMA-Small dataset

t-SNE visualization of the FMA-Small dataset + labels

A few experiments results on the AudioSet dataset

Confusion matrix of the model over the AudioSet dataset (~600 samples, augmented labels, top 1 accuracy)

t-SNE visualization of the AudioSet dataset + labels

About

Releases

Packages

Contributors 2

Languages

jonathanlys01/DL_2023_CLAP

Folders and files

Latest commit

History

Repository files navigation

CLAP (Contrastive Language-Audio Pretraining)

Installation

Usage

Datasets

Experiments

Last audio processed

A few experiments results on the ESC-50 dataset

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

t-SNE visualization of the ESC-50 dataset + labels

A few experiments results on the UrbanSound8K dataset

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

t-SNE visualization of the UrbanSound8K dataset + labels

A few experiments results on the FMA-Small dataset

t-SNE visualization of the FMA-Small dataset + labels

A few experiments results on the AudioSet dataset

Confusion matrix of the model over the AudioSet dataset (~600 samples, augmented labels, top 1 accuracy)

t-SNE visualization of the AudioSet dataset + labels

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages