Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

We propose to tame the visually aligned sound generation by projecting the sound-producing motion to a discriminative temporal visual embedding. This visual embedding can, then, distinguish the transient visual motion from complex background information. which leads to produce high temporal-wise alignment sounds. We refer to it as SPMNet.

News

Code, pre-trained models and all demos will be released here. Welcome to watch this repository for the latest updates.

Demo

Dog

dog_1.mp4

dog_6.mp4

Drum

drum_1.mp4

drum_2.mp4

Firework

firework_1.mp4

firework_2.mp4

Listen for the audio samples on our materials.

Citation

Our paper was accepted by Neurocomputing. Please use this bibtex if you would like to cite our work

@article{Ma2022VisuallyAS,
  title={Visually Aligned Sound Generation via Sound-Producing Motion Parsing},
  author={Xin Ma and Wei Zhong and Long Ye and Qin Zhang},
  journal={Neurocomputing},
  year={2022}
}

Acknowledgments

We acknowledge the following work:

The code base is built upon RegNet repo.
Thanks to SpecVQGAN open source efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

News

Demo

Dog

Drum

Firework

Citation

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

News

Demo

Dog

Drum

Firework

Citation

Acknowledgments