This repository contains the code used for fine-tuning the SpecVQGAN model for the coursework conducted at the National University of Kyiv-Mohyla Academy, titled Exploration of Multimodal Approaches in Image-to-Audio Synthesis. Our primary goal was to examine how several well-known deep learning observations transfer to an image-to-audio domain, via comparative analysis of parameter configurations. Full details on our coursework can be found here.
This repository supports both Linux and Windows setups using conda
virtual environments. We used PyTorch 2.2 and CUDA 12.1 for GPU-accelerated training. The setup files include configurations for Linux and Windows, as well as an optional Docker environment.
To set up the environment:
- Choose the appropriate
conda
configuration file:- For Linux:
conda_env.yaml
- For Windows:
conda_env_win.yaml
- For Linux:
Note
conda_env
files use PyTorch 2.2 with CUDA version 12.1. If you're using a different version, or if you will be using ROCm or CPU for training, you might need to install them manually or modify the environment file according to the official PyTorch installation guide.
- Install the environment with the following command:
conda env create -f conda_env.yaml # For Linux conda env create -f conda_env_win.yaml # For Windows
- (Optional) A Dockerfile is provided for creating a Docker environment. The configuration may require updates, since it wasn't used in our experiments and was simply copied from the original repository.
We used the Visually Aligned Sounds (VAS) dataset for all of our experiments due to its small size, which allowed for quicker training and testing. Training was initially done on a personal RTX 3060 (6 GB VRAM) GPU but later scaled to a desktop with an RTX 3080 (12 GB VRAM) and finally a SLURM cluster using A100 and V100 GPUs with 40 and 80 GB VRAM.
We used the same metrics used by the authors of SpecVQGAN to evaluate the fidelity and relevance: Fréchet Inception Distance (FID) and Melception-based KL-divergence (MKL) (lower is better).
The results and detailed explanations of our findings are available in the main Github README for the coursework. You can access it here.
This repository was forked from the official SpecVQGAN repository. Original paper, full documentation and usage examples can be found here.
This project is licensed under the MIT License.