I set out to create a solution to a very subjective problem: Generating Album art Based only off of the audio of a Song. Obviously this is very hard to do, and while the model created is by no means perfect, it is interesting to see what it has learned to generate conditionally off music.
The model is a Latent Diffusion model with the conditional head replaced with the MERT-v1-95M Music2Vec model.
In the above diagram this means passing the MERT-v1-95M vector output from the song it was fed, into
Here is a peice of album art generated by the model, the top left is the original album art from the song and the rest are novel album arts generated by the model.
While obviously music is very subjective, there are some hints here tha tthe model has learned to match the general vibe of a song, from a qualitative perspective all images tend to be very calm ones, and the model has managed to match the color pallete pretty accurately as well, save the bottom right image.
Here is another example of the model picking up on cues in the song, as both the original and some of the generated images have a very urban feel to them. With the top right literally being a brick wall, and the bottom left one appearing to contain at least the pattern of an automobile. Though these dont appear to have the same fidelity as the images generated from the bluesouth song they do illustrate the models ability to derive some meaning from the music.
We fine tuned on the UNet of Stable Diffusion v1.4 along with using the corrisponding autoencoder. Despite the conditional head being completely different, this suprisingly helped us produce less abstract results than training the UNet from scratch.
- Release Weights
- Release How To