SpecVQGAN Fine-Tuning Experiments

Overview

This repository contains the code used for fine-tuning the SpecVQGAN model for the coursework conducted at the National University of Kyiv-Mohyla Academy, titled Exploration of Multimodal Approaches in Image-to-Audio Synthesis. Our primary goal was to examine how several well-known deep learning observations transfer to an image-to-audio domain, via comparative analysis of parameter configurations. Full details on our coursework can be found here.

Installation

This repository supports both Linux and Windows setups using conda virtual environments. We used PyTorch 2.2 and CUDA 12.1 for GPU-accelerated training. The setup files include configurations for Linux and Windows, as well as an optional Docker environment.

To set up the environment:

Choose the appropriate conda configuration file:
- For Linux: conda_env.yaml
- For Windows: conda_env_win.yaml

Note

conda_env files use PyTorch 2.2 with CUDA version 12.1. If you're using a different version, or if you will be using ROCm or CPU for training, you might need to install them manually or modify the environment file according to the official PyTorch installation guide.

Install the environment with the following command:

conda env create -f conda_env.yaml  # For Linux
conda env create -f conda_env_win.yaml  # For Windows

(Optional) A Dockerfile is provided for creating a Docker environment. The configuration may require updates, since it wasn't used in our experiments and was simply copied from the original repository.

Resources and Dataset

We used the Visually Aligned Sounds (VAS) dataset for all of our experiments due to its small size, which allowed for quicker training and testing. Training was initially done on a personal RTX 3060 (6 GB VRAM) GPU but later scaled to a desktop with an RTX 3080 (12 GB VRAM) and finally a SLURM cluster using A100 and V100 GPUs with 40 and 80 GB VRAM.

Evaluation

We used the same metrics used by the authors of SpecVQGAN to evaluate the fidelity and relevance: Fréchet Inception Distance (FID) and Melception-based KL-divergence (MKL) (lower is better).

Results

The results and detailed explanations of our findings are available in the main Github README for the coursework. You can access it here.

Credits

This repository was forked from the official SpecVQGAN repository. Original paper, full documentation and usage examples can be found here.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
configs		configs
data		data
docs/_assets		docs/_assets
evaluation		evaluation
feature_extraction		feature_extraction
results		results
specvqgan		specvqgan
vocoder		vocoder
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conda_env.yaml		conda_env.yaml
conda_env_win.yaml		conda_env_win.yaml
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
generation_demo.ipynb		generation_demo.ipynb
generation_demo.py		generation_demo.py
neural_audio_codec_demo.ipynb		neural_audio_codec_demo.ipynb
requirements.txt		requirements.txt
sample.sh		sample.sh
sample_visualization.py		sample_visualization.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecVQGAN Fine-Tuning Experiments

Overview

Table of Contents

Installation

Resources and Dataset

Evaluation

Results

Credits

License

About

Releases

Packages

Languages

License

Exploration-of-image-to-audio-synthesis/SpecVQGAN-fine-tuning-experiments

Folders and files

Latest commit

History

Repository files navigation

SpecVQGAN Fine-Tuning Experiments

Overview

Table of Contents

Installation

Resources and Dataset

Evaluation

Results

Credits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages