Skip to content

Exploration-of-image-to-audio-synthesis/SpecVQGAN-fine-tuning-experiments

 
 

Repository files navigation

SpecVQGAN Fine-Tuning Experiments

Overview

This repository contains the code used for fine-tuning the SpecVQGAN model for the coursework conducted at the National University of Kyiv-Mohyla Academy, titled Exploration of Multimodal Approaches in Image-to-Audio Synthesis. Our primary goal was to examine how several well-known deep learning observations transfer to an image-to-audio domain, via comparative analysis of parameter configurations. Full details on our coursework can be found here.

Table of Contents

  1. Installation
  2. Resources and Dataset
  3. Evaluation
  4. Results
  5. Credits
  6. License

Installation

This repository supports both Linux and Windows setups using conda virtual environments. We used PyTorch 2.2 and CUDA 12.1 for GPU-accelerated training. The setup files include configurations for Linux and Windows, as well as an optional Docker environment.

To set up the environment:

  1. Choose the appropriate conda configuration file:
    • For Linux: conda_env.yaml
    • For Windows: conda_env_win.yaml

Note

conda_env files use PyTorch 2.2 with CUDA version 12.1. If you're using a different version, or if you will be using ROCm or CPU for training, you might need to install them manually or modify the environment file according to the official PyTorch installation guide.

  1. Install the environment with the following command:
    conda env create -f conda_env.yaml  # For Linux
    conda env create -f conda_env_win.yaml  # For Windows
  2. (Optional) A Dockerfile is provided for creating a Docker environment. The configuration may require updates, since it wasn't used in our experiments and was simply copied from the original repository.

Resources and Dataset

We used the Visually Aligned Sounds (VAS) dataset for all of our experiments due to its small size, which allowed for quicker training and testing. Training was initially done on a personal RTX 3060 (6 GB VRAM) GPU but later scaled to a desktop with an RTX 3080 (12 GB VRAM) and finally a SLURM cluster using A100 and V100 GPUs with 40 and 80 GB VRAM.

Evaluation

We used the same metrics used by the authors of SpecVQGAN to evaluate the fidelity and relevance: Fréchet Inception Distance (FID) and Melception-based KL-divergence (MKL) (lower is better).

Results

The results and detailed explanations of our findings are available in the main Github README for the coursework. You can access it here.

Credits

This repository was forked from the official SpecVQGAN repository. Original paper, full documentation and usage examples can be found here.

License

This project is licensed under the MIT License.

About

Fine-tuning and comparative analysis of parameter configurations for SpecVQGAN

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 93.5%
  • Python 6.0%
  • Other 0.5%